🏅 Starting Info

¶

💡 Evaluation metrics for regression model:

  • Symmetric Mean Absolute Percentage Error (SMAPE): SMAPE is a metric used to measure the accuracy of a model's forecasts. It calculates the percentage difference between the predicted and actual values while considering their average. SMAPE provides a symmetric measurement that avoids the issue of asymmetry present in other metrics like MAPE.

  • Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted and actual values. It gives an idea of how wrong the predictions were, with equal weight given to all errors.

  • Mean Squared Error (MSE): MSE is similar to MAE but squares the differences before averaging them. It assigns higher weight to large errors, making it more sensitive to outliers.

  • Root Mean Squared Error (RMSE): RMSE is the square root of MSE and has the same units as the original values. It is an interpretable measure that represents the standard deviation of the unexplained variance.

  • R-squared (R2): R2 provides an indication of the goodness of fit of a model's predictions to the actual values. It measures the proportion of variance in the data explained by the model, with a value of 1 indicating a perfect fit.

  • Mean Absolute Percentage Error (MAPE): MAPE is similar to SMAPE, but it can be asymmetrical, penalizing under- or over-forecasts differently. It depends on the distribution of the actual values and can be sensitive to extreme values.

💡 Popular Regression Models:

  • Linear Regression: Linear regression models the relationship between the dependent variable and one or more independent variables by fitting a linear equation. It assumes a linear relationship between the predictors and the target variable.

    from sklearn.linear_model import LinearRegression
    
  • Polynomial Regression: Polynomial regression extends linear regression by introducing polynomial terms to capture nonlinear relationships between the predictors and the target variable.

    from sklearn.preprocessing import PolynomialFeatures
      from sklearn.linear_model import LinearRegression
    
  • Ridge Regression: Ridge regression is a regularized version of linear regression that adds a penalty term to the loss function. It helps to reduce overfitting and improve generalization by shrinking the coefficients towards zero.

    from sklearn.linear_model import Ridge
    
  • Lasso Regression: Lasso regression, similar to ridge regression, adds a penalty term to the loss function. However, it uses L1 regularization, which can perform variable selection by setting some coefficients to exactly zero.

    from sklearn.linear_model import Lasso
    
  • Elastic Net Regression: Elastic Net regression combines L1 and L2 regularization to balance the strengths of ridge and lasso regression. It can handle correlated predictors better than lasso regression alone.

    from sklearn.linear_model import ElasticNet
    
  • Decision Tree Regression: Decision tree regression models the target variable by recursively partitioning the feature space into regions based on feature values. Each partition represents a leaf node with a predicted value.

    from sklearn.tree import DecisionTreeRegressor
    
  • Random Forest Regression: Random forest regression is an ensemble technique that combines multiple decision trees. It improves prediction accuracy by averaging the predictions of individual trees.

    from sklearn.ensemble import RandomForestRegressor
    
  • Gradient Boosting Regression: Gradient boosting regression builds an ensemble of weak prediction models, such as decision trees, in a sequential manner. Each subsequent model corrects the errors made by the previous models, leading to improved predictions.

    from sklearn.ensemble import GradientBoostingRegressor
    
  • Support Vector Regression (SVR): SVR is an extension of support vector machines for regression tasks. It uses a kernel function to map the input space into a higher-dimensional feature space, allowing for nonlinear regression.

    from sklearn.svm import SVR
    
  • Neural Network Regression: Neural network regression utilizes deep learning architectures to model complex relationships between predictors and the target variable. It consists of multiple interconnected layers of artificial neurons that learn to approximate the target function.

    from sklearn.neural_network import MLPRegressor
    
  • K-Nearest Neighbors Regression (KNN): KNN regression predicts the target value based on the average of the target values of its k nearest neighbors in the feature space.

    from sklearn.neighbors import KNeighborsRegressor
    
  • Bayesian Regression: Bayesian regression applies Bayesian inference to estimate the parameters of a regression model. It provides a probabilistic framework for incorporating prior knowledge and uncertainty in the model.

    from sklearn.linear_model import BayesianRidge
    
  • Gaussian Process Regression: Gaussian process regression models the target variable as a distribution over functions. It provides a nonparametric approach to regression that captures uncertainty in predictions.

    from sklearn.gaussian_process import GaussianProcessRegressor
    
  • Generalized Linear Models (GLMs): GLMs are a broad class of regression models that unify various regression techniques, including linear regression, logistic regression, and Poisson regression. They provide a flexible framework for modeling different types of data and response variables.

    from sklearn.linear_model import TweedieRegressor  # Example of GLM
    
  • Huber Regression: Huber regression is a robust regression model that is less sensitive to outliers compared to ordinary least squares regression.

    from sklearn.linear_model import HuberRegressor
    
  • Passive Aggressive Regression: Passive Aggressive regression is an online learning algorithm that updates its model incrementally, suitable for scenarios where data arrives in streams.

    from sklearn.linear_model import PassiveAggressiveRegressor
    
  • Isotonic Regression: Isotonic regression is a nonparametric regression model that fits a monotonic function to the data.

    from sklearn.isotonic import IsotonicRegression
    
  • Orthogonal Matching Pursuit (OMP): OMP is a sparse regression model that iteratively selects relevant features and fits the model using least squares.

    from sklearn.linear_model import OrthogonalMatchingPursuit
    
  • XGBoost Regression: XGBoost is an optimized gradient boosting regression model known for its high performance and scalability.

    import xgboost as xgb
    
  • LightGBM Regression: LightGBM is another gradient boosting regression model that offers high efficiency and handles large-scale data.

    import lightgbm as lgb
    
  • CatBoost Regression: CatBoost is a gradient boosting regression model that supports categorical features and incorporates innovative techniques.

    from catboost import CatBoostRegressor
    
  • HistGradientBoosting Regression: HistGradientBoosting is a histogram-based gradient boosting regression model that provides fast and accurate predictions.

    from sklearn.experimental import enable_hist_gradient_boosting
      from sklearn.ensemble import HistGradientBoostingRegressor
    
  • ARIMA (Autoregressive Integrated Moving Average): ARIMA is a time series forecasting model that combines autoregressive and moving average components.

    from statsmodels.tsa.arima.model import ARIMA
    
  • Prophet: Prophet is a time series forecasting model developed by Facebook that captures seasonal and trend patterns.

    from prophet import Prophet
    

🛫 Imports

¶

In [ ]:
'''
--------------------------------------------------------
REGRESSION MODELS
--------------------------------------------------------
'''
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import PassiveAggressiveRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import BayesianRidge
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from prophet import Prophet
'''
--------------------------------------------------------
CLASSIFICATION MODELS
--------------------------------------------------------
'''
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier


'''
--------------------------------------------------------
FEAUTURE ENGINEERING
--------------------------------------------------------
'''
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import PowerTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, StandardScaler, RobustScaler, QuantileTransformer, KBinsDiscretizer, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, SelectPercentile, SelectFromModel, RFE, RFECV, VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA, KernelPCA, NMF, TruncatedSVD, FactorAnalysis, FastICA, SparsePCA, DictionaryLearning, IncrementalPCA
'''
--------------------------------------------------------
OTHER
--------------------------------------------------------
'''
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, roc_auc_score
import category_encoders as ce
from sklearn.pipeline import Pipeline

import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

from datetime import datetime
import seaborn as sns
import plotly.express as px
import pandas as pd
from functools import partial
import catboost as cb
from DataScienceMethods import Plots, Information
from category_encoders import MEstimateEncoder, GLMMEncoder, OrdinalEncoder, CatBoostEncoder
import shap
import warnings
from sklearn.cluster import KMeans
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
import optuna
from scipy import stats
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
def ignoreWarnings():
    warnings.filterwarnings('ignore')
ignoreWarnings()

📊 Exploratory Data Analysis

¶

In [ ]:
training = pd.read_csv('Training.csv', delimiter=';')
testing = pd.read_csv('Testing.csv', delimiter=';')

#Unlock display limit
pd.set_option('display.max_columns', None)

numericColumns = ['ActualWeightFront','ActualWeightBack','ActualWeightTotal','WheelBase','Overhang']
numericColumnsNoTarget = ['WheelBase','Overhang']
targetColumns = ['ActualWeightTotal','ActualWeightFront','ActualWeightBack']

target_ActualWeightBack = training['ActualWeightBack']
target_ActualWeightFront = training['ActualWeightFront']
target_ActualWeightTotal = training['ActualWeightTotal']

📚 By the Books

Let's check to see if there are any null values

In [ ]:
#Count null values
print(training.isnull().sum())
training.head()
TruckSID              0
ActualWeightFront     0
ActualWeightBack      0
ActualWeightTotal     0
Engine                0
Transmission          0
FrontAxlePosition     0
WheelBase             0
Overhang              0
FrameRails            0
Liner                 0
FrontEndExt           0
Cab                   0
RearAxels             0
RearSusp              0
FrontSusp             0
RearWheels            0
RearTires             0
FrontWheels           0
FrontTires            0
TagAxle               0
EngineFamily          0
TransmissionFamily    0
dtype: int64
Out[ ]:
TruckSID ActualWeightFront ActualWeightBack ActualWeightTotal Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily
0 31081 11280 8030 19310 1012011 2700028 3690005 249 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933469 9050015 930469 3P1998 101D100 270C25
1 30580 10720 6660 17380 1012011 2700022 3690005 183 68 403012 404002 4070004 5000004 330507 3500004 3700011 9142001 933469 9050031 930821 3P1998 101D100 270C24
2 31518 11040 6230 17270 1012001 2700022 3690005 216 68 403012 404002 4070004 5000001 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C24
3 31816 11210 7430 18640 1012002 2700028 3690005 219 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
4 30799 11910 7510 19420 1012019 2700028 3690005 231 104 403012 404002 4070004 5000001 330444 3500004 3700011 9142001 933469 9050037 930469 3P1998 101D102 270C25
In [ ]:
print(testing.isnull().sum())
testing.head()
TruckSID              0
Engine                0
Transmission          0
FrontAxlePosition     0
WheelBase             0
Overhang              0
FrameRails            0
Liner                 0
FrontEndExt           0
Cab                   0
RearAxels             0
RearSusp              0
FrontSusp             0
RearWheels            0
RearTires             0
FrontWheels           0
FrontTires            0
TagAxle               0
EngineFamily          0
TransmissionFamily    0
dtype: int64
Out[ ]:
TruckSID Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily
0 35433 1012003 2700028 3690005 207 98 403012 404998 4070004 5000001 330444 3500004 3700002 9140022 933062 9050037 930469 3P1998 101D97 270C25
1 31091 1012019 2700028 3690005 201 62 403011 404002 4070004 5000004 330507 3500004 3700011 9142001 933469 9050037 930469 3P1998 101D102 270C25
2 26771 1012002 2700022 3690005 213 62 403011 404002 4070004 5000001 330444 3500004 3700002 9142001 933469 9050031 930821 3P1998 101D97 270C24
3 29201 1012003 2700022 3690005 192 110 403012 404998 4070004 5000001 3300041 3500014 3700002 9140005 933062 905549 930821 3P1998 101D97 270C24
4 31083 1012011 2700028 3690005 249 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933469 9050015 930469 3P1998 101D100 270C25

💡 Findings

There are no missing values in the training or testing dataset so we can then check to see which columns are missing from the training set

These will be our targets.

In [ ]:
print(f'''
      The shape of the training dataset is: {training.shape}
      The shape of the testing dataset is: {testing.shape}
      
      The training dataset is missing the following columns:
      
        {set( training.columns ) - set( testing.columns )}
        
      ''')
      The shape of the training dataset is: (2644, 23)
      The shape of the testing dataset is: (962, 20)
      
      The training dataset is missing the following columns:
      
        {'ActualWeightBack', 'ActualWeightFront', 'ActualWeightTotal'}
        
      

💡 Findings

This is a very small dataset so we can rule out deep learning.

In [ ]:
training.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2644 entries, 0 to 2643
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   TruckSID            2644 non-null   int64 
 1   ActualWeightFront   2644 non-null   int64 
 2   ActualWeightBack    2644 non-null   int64 
 3   ActualWeightTotal   2644 non-null   int64 
 4   Engine              2644 non-null   int64 
 5   Transmission        2644 non-null   int64 
 6   FrontAxlePosition   2644 non-null   int64 
 7   WheelBase           2644 non-null   int64 
 8   Overhang            2644 non-null   int64 
 9   FrameRails          2644 non-null   int64 
 10  Liner               2644 non-null   int64 
 11  FrontEndExt         2644 non-null   int64 
 12  Cab                 2644 non-null   int64 
 13  RearAxels           2644 non-null   int64 
 14  RearSusp            2644 non-null   int64 
 15  FrontSusp           2644 non-null   int64 
 16  RearWheels          2644 non-null   int64 
 17  RearTires           2644 non-null   int64 
 18  FrontWheels         2644 non-null   int64 
 19  FrontTires          2644 non-null   int64 
 20  TagAxle             2644 non-null   object
 21  EngineFamily        2644 non-null   object
 22  TransmissionFamily  2644 non-null   object
dtypes: int64(20), object(3)
memory usage: 475.2+ KB
In [ ]:
testing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   TruckSID            962 non-null    int64 
 1   Engine              962 non-null    int64 
 2   Transmission        962 non-null    int64 
 3   FrontAxlePosition   962 non-null    int64 
 4   WheelBase           962 non-null    int64 
 5   Overhang            962 non-null    int64 
 6   FrameRails          962 non-null    int64 
 7   Liner               962 non-null    int64 
 8   FrontEndExt         962 non-null    int64 
 9   Cab                 962 non-null    object
 10  RearAxels           962 non-null    int64 
 11  RearSusp            962 non-null    int64 
 12  FrontSusp           962 non-null    object
 13  RearWheels          962 non-null    int64 
 14  RearTires           962 non-null    int64 
 15  FrontWheels         962 non-null    int64 
 16  FrontTires          962 non-null    int64 
 17  TagAxle             962 non-null    object
 18  EngineFamily        962 non-null    object
 19  TransmissionFamily  962 non-null    object
dtypes: int64(15), object(5)
memory usage: 150.4+ KB
In [ ]:
'''
Take only the numeric columns for analysis
'''
training_numeric = training[numericColumns]
testing_numeric = testing[numericColumnsNoTarget]
training_numeric
Out[ ]:
ActualWeightFront ActualWeightBack ActualWeightTotal WheelBase Overhang
0 11280 8030 19310 249 104
1 10720 6660 17380 183 68
2 11040 6230 17270 216 68
3 11210 7430 18640 219 104
4 11910 7510 19420 231 104
... ... ... ... ... ...
2639 10110 9830 19940 210 104
2640 11150 6700 17850 210 74
2641 10850 7020 17870 222 80
2642 10380 6850 17230 222 56
2643 9820 8760 18580 198 104

2644 rows × 5 columns

In [ ]:
training.nunique()
Out[ ]:
TruckSID              2640
ActualWeightFront      350
ActualWeightBack       441
ActualWeightTotal      520
Engine                  12
Transmission             5
FrontAxlePosition        2
WheelBase               35
Overhang                13
FrameRails               3
Liner                    3
FrontEndExt              2
Cab                      5
RearAxels                4
RearSusp                 4
FrontSusp                4
RearWheels              10
RearTires                5
FrontWheels             11
FrontTires               4
TagAxle                  9
EngineFamily             9
TransmissionFamily       4
dtype: int64

💡 Findings

There seem to be a lot of categorical columns which are considered ints


ISSUES:


The model will assume an order to these even though they are not ordinal.
It will be important to somehow deal when this without going crazy with dummy values.
Although some tree-based and gradient-boosted tree-based models can handle high dimensionality others cannot.

In [ ]:
dataframeWithoutTarget = training.drop(columns=targetColumns)
#Get all the object columns
obj_cols_ = [col for col in training.columns if col not in numericColumns]
training[obj_cols_] = training[obj_cols_].astype('string')
testing[obj_cols_] = testing[obj_cols_].astype('string')
training.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2644 entries, 0 to 2643
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   TruckSID            2644 non-null   string
 1   ActualWeightFront   2644 non-null   int64 
 2   ActualWeightBack    2644 non-null   int64 
 3   ActualWeightTotal   2644 non-null   int64 
 4   Engine              2644 non-null   string
 5   Transmission        2644 non-null   string
 6   FrontAxlePosition   2644 non-null   string
 7   WheelBase           2644 non-null   int64 
 8   Overhang            2644 non-null   int64 
 9   FrameRails          2644 non-null   string
 10  Liner               2644 non-null   string
 11  FrontEndExt         2644 non-null   string
 12  Cab                 2644 non-null   string
 13  RearAxels           2644 non-null   string
 14  RearSusp            2644 non-null   string
 15  FrontSusp           2644 non-null   string
 16  RearWheels          2644 non-null   string
 17  RearTires           2644 non-null   string
 18  FrontWheels         2644 non-null   string
 19  FrontTires          2644 non-null   string
 20  TagAxle             2644 non-null   string
 21  EngineFamily        2644 non-null   string
 22  TransmissionFamily  2644 non-null   string
dtypes: int64(5), string(18)
memory usage: 475.2 KB

💡 Findings

Cab and FrontSusp are objects in the testing data but ints in the training data.
This implies that the testing data might be dirty or hold values that do not exist in the training data.

In [ ]:
cols_in_training = training.columns.to_list()
#Exclude the TruckSID column
cols_in_testing = testing.columns.to_list()[1:] 

#Get all the columns that are not numeric
cols_in_training = [item for item in cols_in_training if item not in numericColumns]
cols_in_testing = [item for item in cols_in_testing if item not in numericColumns]


for col in cols_in_testing:
    training_values = set(training[col].unique())
    testing_values = set(testing[col].unique())

    # Find values in testing that are not in training
    diff_values = testing_values - training_values

    # Print out the different values
    for val in diff_values:
        print(f"'{val}' from '{col}' is not in the training set")
'2700023' from 'Transmission' is not in the training set
'500QXX' from 'Cab' is not in the training set
'330545' from 'RearAxels' is not in the training set
'370QXX' from 'FrontSusp' is not in the training set
'9140019' from 'RearWheels' is not in the training set
'914105' from 'RearWheels' is not in the training set

💡 Findings

There appear to be some new categories which look very suspicious. Some might be new but others are probably misspellings.
When we encode the category variables I will use


TargetEncoder(handle_unknown='value', handle_missing='value')

This will ensure that when new values it does not recognise are added it will convert them to -1 or the average

In [ ]:
training['EngineFamily'].unique()
Out[ ]:
<StringArray>
[ '101D100',   '101D97',  '101D102',  '101D97 ',  '101D97.',   '101D56',
   '101D69', '101.D100',   '101D67']
Length: 9, dtype: string

💡 Findings

There seem to be some '.' in the data, these appear to be mistakes so we will quickly get rid of them.

In [ ]:
all_cols = training.columns
for col in all_cols:
    try:
        training[col] = training[col].str.replace(' ', '')
    except:
        pass
    
training['EngineFamily'] = training['EngineFamily'].str.replace('.', '')
training['EngineFamily'].unique()
Out[ ]:
<StringArray>
['101D100', '101D97', '101D102', '101D56', '101D69', '101D67']
Length: 6, dtype: string

💡 Findings

All anomalies have been dealt with.
Let's continue looking at the data

In [ ]:
from DataScienceMethods import Information, Plots
Information.summary(training_numeric)
data shape: (2644, 5)
Out[ ]:
data type #missing %missing #unique min max first value second value third value
ActualWeightFront int64 0 0.0 350 7801.0 12890.0 11280 10720 11040
ActualWeightBack int64 0 0.0 441 4650.0 10030.0 8030 6660 6230
ActualWeightTotal int64 0 0.0 520 15721.0 20640.0 19310 17380 17270
WheelBase int64 0 0.0 35 162.0 285.0 249 183 216
Overhang int64 0 0.0 13 56.0 3618545.0 104 68 68

💡 Findings

3 million seems a bit excessive for an overhang


It is obviously an outlier so let's see if the testing data shares the same fate.

In [ ]:
Information.summary(testing_numeric)
data shape: (962, 2)
Out[ ]:
data type #missing %missing #unique min max first value second value third value
WheelBase int64 0 0.0 42 150.0 285.0 207 201 213
Overhang int64 0 0.0 14 56.0 3618545.0 98 62 62
In [ ]:
sns.boxplot(y=training_numeric['Overhang'])
plt.title('Box and Whisker Plot')
plt.show()

💡 Findings

The testing data also has the large outlier.
This means this might be a common occurence in the data.
So instead of dropping these outliers, let's rather convert them to the average (That is present without the outliers) which is 90

In [ ]:
'''
Replace all outliers with a given value (90)
'''
def replaceOutliers(data, variable, multiplier=1.5, replacement_value=90):
    Q1 = data[variable].quantile(0.25) #Specify the first quartile
    Q3 = data[variable].quantile(0.75) #Specify the third quartile
    IQR = Q3 - Q1 #Interquartile range
    
    lower_bound = Q1 - multiplier*IQR #Specify the lower bound
    upper_bound = Q3 + multiplier*IQR #Specify the upper bound
    print('Lower bound: ', lower_bound)
    print('Upper bound: ', upper_bound)
    # Replacing outliers with the given value
    data.loc[data[variable] < lower_bound, variable] = replacement_value
    data.loc[data[variable] > upper_bound, variable] = replacement_value
    return data

training_numeric = replaceOutliers(training_numeric, 'Overhang', 10)
Information.summary(training_numeric)
Lower bound:  -160.0
Upper bound:  344.0
data shape: (2644, 5)
Out[ ]:
data type #missing %missing #unique min max first value second value third value
ActualWeightFront int64 0 0.0 350 7801.0 12890.0 11280 10720 11040
ActualWeightBack int64 0 0.0 441 4650.0 10030.0 8030 6660 6230
ActualWeightTotal int64 0 0.0 520 15721.0 20640.0 19310 17380 17270
WheelBase int64 0 0.0 35 162.0 285.0 249 183 216
Overhang int64 0 0.0 12 56.0 128.0 104 68 68
In [ ]:
variables = training_numeric.columns

for col in variables:
    plt.figure(figsize=(10, 6))

    # Histogram
    sns.histplot(training_numeric[col], bins=30, kde=True)  # kde=True will also plot the kernel density estimate

    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

💡 Findings

The targets follow a normal distribution while the other two variables don't really.

In [ ]:
training = replaceOutliers(training,'Overhang', 10)
testing = replaceOutliers(testing,'Overhang', 10)
Information.summary(training)
Lower bound:  -160.0
Upper bound:  344.0
Lower bound:  -160.0
Upper bound:  344.0
data shape: (2644, 23)
Out[ ]:
data type #missing %missing #unique min max first value second value third value
TruckSID string 0 0.0 2640 NaN NaN 31081 30580 31518
ActualWeightFront int64 0 0.0 350 7801.0 12890.0 11280 10720 11040
ActualWeightBack int64 0 0.0 441 4650.0 10030.0 8030 6660 6230
ActualWeightTotal int64 0 0.0 520 15721.0 20640.0 19310 17380 17270
Engine string 0 0.0 12 NaN NaN 1012011 1012011 1012001
Transmission string 0 0.0 5 NaN NaN 2700028 2700022 2700022
FrontAxlePosition string 0 0.0 2 NaN NaN 3690005 3690005 3690005
WheelBase int64 0 0.0 35 162.0 285.0 249 183 216
Overhang int64 0 0.0 12 56.0 128.0 104 68 68
FrameRails string 0 0.0 3 NaN NaN 403012 403012 403012
Liner string 0 0.0 3 NaN NaN 404002 404002 404002
FrontEndExt string 0 0.0 2 NaN NaN 4070004 4070004 4070004
Cab string 0 0.0 5 NaN NaN 5000002 5000004 5000001
RearAxels string 0 0.0 4 NaN NaN 330444 330507 330444
RearSusp string 0 0.0 4 NaN NaN 3500004 3500004 3500004
FrontSusp string 0 0.0 4 NaN NaN 3700002 3700011 3700002
RearWheels string 0 0.0 10 NaN NaN 9140014 9142001 9140014
RearTires string 0 0.0 5 NaN NaN 933469 933469 933062
FrontWheels string 0 0.0 11 NaN NaN 9050015 9050031 9050015
FrontTires string 0 0.0 4 NaN NaN 930469 930821 930469
TagAxle string 0 0.0 7 NaN NaN 3P1998 3P1998 3P1998
EngineFamily string 0 0.0 6 NaN NaN 101D100 101D100 101D97
TransmissionFamily string 0 0.0 2 NaN NaN 270C25 270C24 270C24
In [ ]:
Information.summary(testing)
data shape: (962, 20)
Out[ ]:
data type #missing %missing #unique min max first value second value third value
TruckSID string 0 0.0 962 NaN NaN 35433 31091 26771
Engine string 0 0.0 12 NaN NaN 1012003 1012019 1012002
Transmission string 0 0.0 6 NaN NaN 2700028 2700028 2700022
FrontAxlePosition string 0 0.0 2 NaN NaN 3690005 3690005 3690005
WheelBase int64 0 0.0 42 150.0 285.0 207 201 213
Overhang int64 0 0.0 13 56.0 128.0 98 62 62
FrameRails string 0 0.0 3 NaN NaN 403012 403011 403011
Liner string 0 0.0 3 NaN NaN 404998 404002 404002
FrontEndExt string 0 0.0 2 NaN NaN 4070004 4070004 4070004
Cab string 0 0.0 6 NaN NaN 5000001 5000004 5000001
RearAxels string 0 0.0 5 NaN NaN 330444 330507 330444
RearSusp string 0 0.0 4 NaN NaN 3500004 3500004 3500004
FrontSusp string 0 0.0 5 NaN NaN 3700002 3700011 3700002
RearWheels string 0 0.0 12 NaN NaN 9140022 9142001 9142001
RearTires string 0 0.0 5 NaN NaN 933062 933469 933469
FrontWheels string 0 0.0 11 NaN NaN 9050037 9050037 9050031
FrontTires string 0 0.0 4 NaN NaN 930469 930469 930821
TagAxle string 0 0.0 6 NaN NaN 3P1998 3P1998 3P1998
EngineFamily string 0 0.0 6 NaN NaN 101D97 101D102 101D97
TransmissionFamily string 0 0.0 2 NaN NaN 270C25 270C25 270C24
In [ ]:
sns.boxplot(y=training_numeric['Overhang'])
plt.title('Box and Whisker Plot')
plt.show()

💡 Findings

The data is looking much better.
But we will need to keep in mind that we want to deal with the outliers in the testing data at some point.

In [ ]:
training.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2644 entries, 0 to 2643
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   TruckSID            2644 non-null   string
 1   ActualWeightFront   2644 non-null   int64 
 2   ActualWeightBack    2644 non-null   int64 
 3   ActualWeightTotal   2644 non-null   int64 
 4   Engine              2644 non-null   string
 5   Transmission        2644 non-null   string
 6   FrontAxlePosition   2644 non-null   string
 7   WheelBase           2644 non-null   int64 
 8   Overhang            2644 non-null   int64 
 9   FrameRails          2644 non-null   string
 10  Liner               2644 non-null   string
 11  FrontEndExt         2644 non-null   string
 12  Cab                 2644 non-null   string
 13  RearAxels           2644 non-null   string
 14  RearSusp            2644 non-null   string
 15  FrontSusp           2644 non-null   string
 16  RearWheels          2644 non-null   string
 17  RearTires           2644 non-null   string
 18  FrontWheels         2644 non-null   string
 19  FrontTires          2644 non-null   string
 20  TagAxle             2644 non-null   string
 21  EngineFamily        2644 non-null   string
 22  TransmissionFamily  2644 non-null   string
dtypes: int64(5), string(18)
memory usage: 475.2 KB
In [ ]:
training.to_csv('training_super_clean2.csv', index=False)

📚 By the Books

Let's check the correlations between the numeric features and the target variables.

In [ ]:
# Create a mask for the upper triangle.
mask = np.triu(np.ones_like(training_numeric.corr(), dtype=bool))
# Set up the matplotlib figure size.
sns.set(rc={'figure.figsize':(20,10)})
# Draw the heatmap with the mask.
sns.heatmap(training_numeric.corr(), annot=True, mask=mask)
Out[ ]:
<AxesSubplot:>

💡 Findings

We see a slight positive correlation between Back and total from overhang, but no variables stand out as heroes.

🧠 A good idea

Let's check for Heteroskedasticity

In [ ]:
def checkForHeteroskedasticity(dependentNames,dependentVariables,independentNames,independentVariables):
    for y_name, y in zip(dependentNames, dependentVariables):
        for X_name, X in zip(independentNames, independentVariables):
            X = sm.add_constant(X)
            print(f'{y_name} - {X_name}')
            # Fit regression model
            model = sm.OLS(y, X).fit()

            # Perform Breusch-Pagan test
            bp_test = het_breuschpagan(model.resid, model.model.exog)
            labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
            results = dict(zip(labels, bp_test))

            print(results)

            alpha = 0.05
            if results['LM-Test p-value'] < alpha or results['F-Test p-value'] < alpha:
                print("Reject the null hypothesis: Evidence of heteroskedasticity.")
            else:
                print("Fail to reject the null hypothesis: No evidence of heteroskedasticity.")
            print('')

target_ActualWeightTotal = training_numeric['ActualWeightTotal']
target_ActualWeightFront = training_numeric['ActualWeightFront']
target_ActualWeightBack = training_numeric['ActualWeightBack']
X_WheelBase = training_numeric['WheelBase']
X_Overhang = training_numeric['Overhang']
X_All = training_numeric.drop(['ActualWeightTotal', 'ActualWeightFront', 'ActualWeightBack'], axis=1)

dependentVariables = [target_ActualWeightTotal, target_ActualWeightFront, target_ActualWeightBack]
independentVariables = [X_WheelBase, X_Overhang, X_All]

dependentNames = ['ActualWeightTotal', 'ActualWeightFront', 'ActualWeightBack']
independentNames = ['WheelBase', 'Overhang', 'All']
    
checkForHeteroskedasticity(dependentNames,dependentVariables,independentNames,independentVariables) 
ActualWeightTotal - WheelBase
{'LM Statistic': 8.762936123953516, 'LM-Test p-value': 0.003074137501122203, 'F-Statistic': 8.785424870061815, 'F-Test p-value': 0.0030636132978057976}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightTotal - Overhang
{'LM Statistic': 3.6970266508277803, 'LM-Test p-value': 0.054509523567570786, 'F-Statistic': 3.6994028753816743, 'F-Test p-value': 0.054539291785276965}
Fail to reject the null hypothesis: No evidence of heteroskedasticity.

ActualWeightTotal - All
{'LM Statistic': 11.745150415183469, 'LM-Test p-value': 0.0028156132013616875, 'F-Statistic': 5.892085686800411, 'F-Test p-value': 0.0027976391354201393}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightFront - WheelBase
{'LM Statistic': 32.94340347638752, 'LM-Test p-value': 9.488113756834247e-09, 'F-Statistic': 33.333812871194404, 'F-Test p-value': 8.668664400292033e-09}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightFront - Overhang
{'LM Statistic': 54.98825995311089, 'LM-Test p-value': 1.21251928235889e-13, 'F-Statistic': 56.11368251017964, 'F-Test p-value': 9.27204061053086e-14}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightFront - All
{'LM Statistic': 49.82460261088168, 'LM-Test p-value': 1.516090098971871e-11, 'F-Statistic': 25.361965815374816, 'F-Test p-value': 1.2299224575118543e-11}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightBack - WheelBase
{'LM Statistic': 68.88135782553877, 'LM-Test p-value': 1.0456862398888746e-16, 'F-Statistic': 70.67035452059922, 'F-Test p-value': 6.805439212303132e-17}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightBack - Overhang
{'LM Statistic': 140.58584127260607, 'LM-Test p-value': 1.9820090908421064e-32, 'F-Statistic': 148.36849561921454, 'F-Test p-value': 3.0167609854688856e-33}
Reject the null hypothesis: Evidence of heteroskedasticity.

ActualWeightBack - All
{'LM Statistic': 134.89977627214296, 'LM-Test p-value': 5.091969310887176e-30, 'F-Statistic': 70.9956313752597, 'F-Test p-value': 9.274388741601217e-31}
Reject the null hypothesis: Evidence of heteroskedasticity.

💡 Findings


ActualWeightTotal - Overhang:
No evidence of heteroskedasticity. Both p-values are greater than 0.05, so we fail to reject the null hypothesis.


The Rest
Evidence of heteroskedasticity


When using "Overhang" to predict "ActualWeightTotal", the model's errors seem to have a constant variance, which is a good thing. It means the model is probably well-specified for this relationship, at least in terms of the assumption of homoskedasticity (constant variance of errors).

However, when "Overhang" is used to predict either "ActualWeightFront" or "ActualWeightBack", there is evidence that the variance of the model's errors is not constant. This can be problematic as it can lead to unreliable significance tests and confidence intervals.


Applying a log function actually made it perform slightly worse

In [ ]:
def threeDScatter(df, category,X,Y,Z):
    # Create a color palette from Seaborn
   
    unique_categories = df[category].unique()
    palette = dict(zip(unique_categories, sns.color_palette("husl", len(unique_categories))))
    colors = df[category].map(palette)

    # Set the Seaborn style
    sns.set_style("whitegrid")

    fig = plt.figure(figsize=(12, 12))
    ax = fig.add_subplot(111, projection='3d')
    scatter = ax.scatter(df[X], df[Y], df[Z], c=colors, s=60, edgecolors='w', depthshade=True)

    # Setting the labels for the axes
    ax.set_title(f'{X} vs {Y} vs {Z} for {category}')
    ax.set_xlabel(X)
    ax.set_ylabel(Y)
    ax.set_zlabel(Z)

    # Add legend
    from matplotlib.lines import Line2D
    legend_elements = [Line2D([0], [0], marker='o', color='w', markerfacecolor=palette[key], markersize=10, label=key) for key in palette]
    ax.legend(handles=legend_elements, loc='upper right')

    plt.show()
    
threeDScatter(training,'EngineFamily','ActualWeightFront','ActualWeightBack','ActualWeightTotal')
threeDScatter(training,'Engine','ActualWeightFront','ActualWeightBack','ActualWeightTotal')
threeDScatter(training,'Transmission','ActualWeightFront','ActualWeightBack','ActualWeightTotal')
threeDScatter(training,'TransmissionFamily','ActualWeightFront','ActualWeightBack','ActualWeightTotal')

💡 Findings

All three features seem to show decent grouping within the data.

🔨 Data Preparation

¶

📚 By the Book

Let's define our X and y value

In [ ]:
X = training.drop(columns= targetColumns  + ['TruckSID'])
y = target_ActualWeightFront
X
Out[ ]:
Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily
0 1012011 2700028 3690005 249 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933469 9050015 930469 3P1998 101D100 270C25
1 1012011 2700022 3690005 183 68 403012 404002 4070004 5000004 330507 3500004 3700011 9142001 933469 9050031 930821 3P1998 101D100 270C24
2 1012001 2700022 3690005 216 68 403012 404002 4070004 5000001 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C24
3 1012002 2700028 3690005 219 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
4 1012019 2700028 3690005 231 104 403012 404002 4070004 5000001 330444 3500004 3700011 9142001 933469 9050037 930469 3P1998 101D102 270C25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2639 1012012 2700024 3690005 210 104 403012 404998 4070004 5000002 3300041 3500003 3700002 9140016 933469 9050015 930469 3P1998 101D100 270C24
2640 1012002 2700028 3690005 210 74 403012 404002 4070004 5000003 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
2641 1012002 2700028 3690005 222 80 403012 404998 4070004 5000002 330444 3500014 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
2642 1012011 2700022 3690005 222 56 403012 404002 4070004 5000001 330444 3500004 3700002 9142003 933469 9052003 930469 3P1998 101D100 270C24
2643 1012011 2700022 3690005 198 104 403012 404998 4070004 5000002 330507 3500004 3700002 9142001 933469 9050037 930469 3P1998 101D100 270C24

2644 rows × 19 columns

📚 By the Book

Let's define a pipeline to process the data

In [ ]:
# Custom transformer to convert string columns to category
class StringToCategory(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        string_cols = X.select_dtypes(include=['string']).columns
        X[string_cols] = X[string_cols].astype('category')
        return X

#Scale the data
class DataFrameScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = StandardScaler()
        self.columns = None

    def fit(self, X, y=None):
        self.scaler.fit(X, y)
        self.columns = X.columns
        return self

    def transform(self, X):
        X_scaled = self.scaler.transform(X)
        return pd.DataFrame(X_scaled, columns=self.columns)
    
# Create the pipeline
data_prep_pipeline = Pipeline([
    ('str_to_cat', StringToCategory()),
    ('target_encode', ce.TargetEncoder()),
    ('scale', DataFrameScaler())
])
In [ ]:
X_transformed = data_prep_pipeline.fit_transform(X, y)
X_transformed
Out[ ]:
Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily
0 -0.775150 1.199613 0.085077 2.620206 0.867729 0.647233 1.118745 -0.061616 -0.946623 0.928864 0.251651 0.402826 1.262227 -0.240839 1.224191 -0.359090 -0.15539 -0.774318 1.059240
1 -0.775150 -0.917003 0.085077 -1.545882 -1.391651 0.647233 1.118745 -0.061616 1.193472 -0.427545 0.251651 -0.687423 -0.317981 -0.240839 0.801290 1.816748 -0.15539 -0.774318 -0.944073
2 0.194674 -0.917003 0.085077 0.537162 -1.391651 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 -0.944073
3 0.270655 1.199613 0.085077 0.726530 0.867729 0.647233 1.118745 -0.061616 -0.946623 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240
4 1.580971 1.199613 0.085077 1.484000 0.867729 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 -0.687423 -0.317981 -0.240839 -0.856580 -0.359090 -0.15539 1.551007 1.059240
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2639 -0.175652 -0.231394 0.085077 0.158427 0.867729 0.647233 -0.906376 -0.061616 -0.946623 -1.712860 -3.511279 0.402826 1.691276 -0.240839 1.224191 -0.359090 -0.15539 -0.774318 -0.944073
2640 0.270655 1.199613 0.085077 0.158427 -1.015088 0.647233 1.118745 -0.061616 2.089485 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240
2641 0.270655 1.199613 0.085077 0.915898 -0.638524 0.647233 -0.906376 -0.061616 -0.946623 0.928864 0.517296 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240
2642 -0.775150 -0.917003 0.085077 0.915898 -2.144777 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 0.402826 -0.721544 -0.240839 -1.037440 -0.359090 -0.15539 -0.774318 -0.944073
2643 -0.775150 -0.917003 0.085077 -0.599043 0.867729 0.647233 -0.906376 -0.061616 -0.946623 -0.427545 0.251651 0.402826 -0.317981 -0.240839 -0.856580 -0.359090 -0.15539 -0.774318 -0.944073

2644 rows × 19 columns

In [ ]:
newData = pd.concat([X_transformed, target_ActualWeightBack], axis=1)
newData = pd.concat([newData, target_ActualWeightFront], axis=1)
newData = pd.concat([newData, target_ActualWeightTotal], axis=1)
newData
Out[ ]:
Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily ActualWeightBack ActualWeightFront ActualWeightTotal
0 -0.775150 1.199613 0.085077 2.620206 0.867729 0.647233 1.118745 -0.061616 -0.946623 0.928864 0.251651 0.402826 1.262227 -0.240839 1.224191 -0.359090 -0.15539 -0.774318 1.059240 8030 11280 19310
1 -0.775150 -0.917003 0.085077 -1.545882 -1.391651 0.647233 1.118745 -0.061616 1.193472 -0.427545 0.251651 -0.687423 -0.317981 -0.240839 0.801290 1.816748 -0.15539 -0.774318 -0.944073 6660 10720 17380
2 0.194674 -0.917003 0.085077 0.537162 -1.391651 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 -0.944073 6230 11040 17270
3 0.270655 1.199613 0.085077 0.726530 0.867729 0.647233 1.118745 -0.061616 -0.946623 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240 7430 11210 18640
4 1.580971 1.199613 0.085077 1.484000 0.867729 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 -0.687423 -0.317981 -0.240839 -0.856580 -0.359090 -0.15539 1.551007 1.059240 7510 11910 19420
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2639 -0.175652 -0.231394 0.085077 0.158427 0.867729 0.647233 -0.906376 -0.061616 -0.946623 -1.712860 -3.511279 0.402826 1.691276 -0.240839 1.224191 -0.359090 -0.15539 -0.774318 -0.944073 9830 10110 19940
2640 0.270655 1.199613 0.085077 0.158427 -1.015088 0.647233 1.118745 -0.061616 2.089485 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240 6700 11150 17850
2641 0.270655 1.199613 0.085077 0.915898 -0.638524 0.647233 -0.906376 -0.061616 -0.946623 0.928864 0.517296 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240 7020 10850 17870
2642 -0.775150 -0.917003 0.085077 0.915898 -2.144777 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 0.402826 -0.721544 -0.240839 -1.037440 -0.359090 -0.15539 -0.774318 -0.944073 6850 10380 17230
2643 -0.775150 -0.917003 0.085077 -0.599043 0.867729 0.647233 -0.906376 -0.061616 -0.946623 -0.427545 0.251651 0.402826 -0.317981 -0.240839 -0.856580 -0.359090 -0.15539 -0.774318 -0.944073 8760 9820 18580

2644 rows × 22 columns

In [ ]:
mask = np.triu(np.ones_like(newData.corr(), dtype=bool))
# Set up the matplotlib figure size.
sns.set(rc={'figure.figsize':(20,10)})
# Draw the heatmap with the mask.
sns.heatmap(newData.corr(), annot=True, mask=mask)
Out[ ]:
<AxesSubplot:>

🏆 Model Selection

¶

🧠 A good idea

Let's see if we can get a good idea of how many estimators we should use as a reference.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed, target_ActualWeightFront, test_size=0.3, random_state=42)
# Number of trees to evaluate
n_trees = list(range(10, 201, 10))
oob_errors = []

for n in n_trees:
    model = RandomForestRegressor(n_estimators=n, oob_score=True, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)
    oob_errors.append(1 - model.oob_score_)

plt.figure(figsize=(12, 6))
plt.plot(n_trees, oob_errors, '-o')
plt.title('OOB Error Across Different Numbers of Trees')
plt.xlabel('Number of Trees')
plt.ylabel('OOB Error')
plt.show()

💡 Findings

When testing estimators let's use 30 as a point of reference.


Using this information let's try and get an idea of which models will work really well with this data.

In [ ]:
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import ElasticNet, BayesianRidge, ARDRegression, PassiveAggressiveRegressor, HuberRegressor, TheilSenRegressor, RANSACRegressor
from sklearn.svm import LinearSVR
from sklearn.kernel_ridge import KernelRidge
import pandas as pd
from sklearn.model_selection import KFold

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
import numpy as np

#Add medals to the best scoring algorithms
def append_medals(df):
    # For R2, higher is better
    top3_r2 = df['R2 Score'].nlargest(3).index
    df.at[top3_r2[0], 'R2 Score'] = str(df['R2 Score'].iloc[top3_r2[0]]) + ' 🏆'
    df.at[top3_r2[1], 'R2 Score'] = str(df['R2 Score'].iloc[top3_r2[1]]) + ' 🥈'
    df.at[top3_r2[2], 'R2 Score'] = str(df['R2 Score'].iloc[top3_r2[2]]) + ' 🥉'
    
    # For the rest, lower is better
    metrics = ['RMSE', 'MSE', 'MAE', 'MAPE', 'SMAPE']
    for metric in metrics:
        top3_indices = df[metric].nsmallest(3).index
        df.at[top3_indices[0], metric] = str(df[metric].iloc[top3_indices[0]]) + ' 🏆'
        df.at[top3_indices[1], metric] = str(df[metric].iloc[top3_indices[1]]) + ' 🥈'
        df.at[top3_indices[2], metric] = str(df[metric].iloc[top3_indices[2]]) + ' 🥉'

    return df

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def smape(y_true, y_pred):
    return 100 * np.mean(2 * np.abs(y_pred - y_true) / (np.abs(y_pred) + np.abs(y_true)))

def evaluate_models(models, X, y, k_folds=5):
    final_results = []
    
    kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

    for model in models:
        results = []
        for train_index, test_index in kf.split(X):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]

            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            r2 = round(r2_score(y_test, y_pred),4)
            rmse = round(np.sqrt(mean_squared_error(y_test, y_pred)),4)
            mse = round(mean_squared_error(y_test, y_pred),4)
            mae = mean_absolute_error(y_test, y_pred)
            mape = mean_absolute_percentage_error(y_test, y_pred)
            smape_val = smape(y_test, y_pred)

            results.append([r2, rmse, mse, mae, mape, smape_val])

        # Calculate average scores
        avg_scores = [round(np.mean([result[i] for result in results]), 4) for i in range(len(results[0]))]
        final_results.append([type(model).__name__] + avg_scores)
        print("✔️ {}".format(type(model).__name__))
    columns = ["Model", "R2 Score", "RMSE", "MSE", "MAE", "MAPE", "SMAPE"]
    results_df = pd.DataFrame(final_results, columns=columns)

    # Append ✔️ to the best scores
    results_df = append_medals(results_df)
    
    return results_df

models = [
    # Ensemble Methods
    RandomForestRegressor(n_estimators=30, random_state=0),
    GradientBoostingRegressor(n_estimators=30, random_state=0),
    AdaBoostRegressor(n_estimators=30, random_state=0),
    BaggingRegressor(n_estimators=30, random_state=0),
    ExtraTreesRegressor(n_estimators=30, random_state=0),
    
    # Linear Models
    LinearRegression(),
    Ridge(),
    Lasso(),
    ElasticNet(),
    BayesianRidge(),
    ARDRegression(),
    PassiveAggressiveRegressor(),
    HuberRegressor(),
    TheilSenRegressor(),
    RANSACRegressor(),
    
    # SVM
    SVR(),
    LinearSVR(),
    
    # Neighbors
    KNeighborsRegressor(),
    
    # Neural Network
    MLPRegressor(hidden_layer_sizes=(100,), max_iter=500, random_state=0),
    
    # Tree-based
    DecisionTreeRegressor(),
    XGBRegressor(n_estimators=30, random_state=0),
    LGBMRegressor(n_estimators=30, random_state=0),
    
    # Kernel Ridge Regression
    KernelRidge(),
        
    
]

Predicting Total

¶

In [ ]:
y = target_ActualWeightTotal
X_transformed = data_prep_pipeline.fit_transform(X, y)

results_df_total = evaluate_models(models, X_transformed, y)
results_df_total
✔️ RandomForestRegressor
✔️ GradientBoostingRegressor
✔️ AdaBoostRegressor
✔️ BaggingRegressor
✔️ ExtraTreesRegressor
✔️ LinearRegression
✔️ Ridge
✔️ Lasso
✔️ ElasticNet
✔️ BayesianRidge
✔️ ARDRegression
✔️ PassiveAggressiveRegressor
✔️ HuberRegressor
✔️ TheilSenRegressor
✔️ RANSACRegressor
✔️ SVR
✔️ LinearSVR
✔️ KNeighborsRegressor
✔️ MLPRegressor
✔️ DecisionTreeRegressor
✔️ XGBRegressor
✔️ LGBMRegressor
✔️ KernelRidge
Out[ ]:
Model R2 Score RMSE MSE MAE MAPE SMAPE
0 RandomForestRegressor 0.8011 🥈 478.7839 🥈 230298.5194 🥈 290.2632 🥈 1.6181 🥈 1.6119 🥈
1 GradientBoostingRegressor 0.7269 563.3094 317832.8091 404.0855 2.2576 2.2482
2 AdaBoostRegressor 0.5724 705.6288 498249.4218 546.5403 3.0514 3.0409
3 BaggingRegressor 0.8011 🥉 478.8759 🥉 230345.9915 🥉 290.4994 🥉 1.6193 🥉 1.6131 🥉
4 ExtraTreesRegressor 0.7987 481.9655 233258.4131 288.942 🏆 1.6095 🏆 1.6037 🏆
5 LinearRegression 0.6614 627.4376 394391.666 459.6826 2.5602 2.5534
6 Ridge 0.6615 627.4257 394374.8146 459.6643 2.5601 2.5533
7 Lasso 0.6615 627.3773 394296.7737 459.4429 2.5588 2.5521
8 ElasticNet 0.6397 647.966 420133.1125 483.4911 2.6924 2.6869
9 BayesianRidge 0.6615 627.3957 394308.1675 459.5972 2.5597 2.553
10 ARDRegression 0.6616 627.3131 394215.7576 458.5653 2.5535 2.5469
11 PassiveAggressiveRegressor 0.6513 636.9409 406414.8129 459.6518 2.5601 2.5517
12 HuberRegressor 0.6577 630.8401 398798.3915 456.886 2.5433 2.5362
13 TheilSenRegressor -4.0342 2334.9025 5732687.854 927.0029 5.1288 6.4341
14 RANSACRegressor 0.437 797.5331 645369.6614 567.6713 3.1597 3.163
15 SVR 0.1293 1008.7353 1018526.4218 806.9198 4.4698 4.4829
16 LinearSVR -216.1259 15903.7851 252932788.781 15866.4447 88.321 158.2088
17 KNeighborsRegressor 0.757 528.8404 281208.5656 325.7707 1.8184 1.8106
18 MLPRegressor -42.7834 7141.5165 51030007.1392 6199.4957 34.7312 45.495
19 DecisionTreeRegressor 0.7872 495.6468 246499.6868 294.6707 1.6412 1.6361
20 XGBRegressor 0.8059 🏆 473.5516 🏆 225130.7845 🏆 291.2641 1.6239 1.6177
21 LGBMRegressor 0.7844 499.2884 250318.1475 321.993 1.7957 1.7877
22 KernelRidge -283.0564 18186.9662 330774343.0359 18144.7583 101.3038 191.9926

Predicting Front

¶

In [ ]:
y = target_ActualWeightFront
X_transformed = data_prep_pipeline.fit_transform(X, y)

results_df_total = evaluate_models(models, X_transformed, y)
results_df_total
✔️ RandomForestRegressor
✔️ GradientBoostingRegressor
✔️ AdaBoostRegressor
✔️ BaggingRegressor
✔️ ExtraTreesRegressor
✔️ LinearRegression
✔️ Ridge
✔️ Lasso
✔️ ElasticNet
✔️ BayesianRidge
✔️ ARDRegression
✔️ PassiveAggressiveRegressor
✔️ HuberRegressor
✔️ TheilSenRegressor
✔️ RANSACRegressor
✔️ SVR
✔️ LinearSVR
✔️ KNeighborsRegressor
✔️ MLPRegressor
✔️ DecisionTreeRegressor
✔️ XGBRegressor
✔️ LGBMRegressor
✔️ KernelRidge
Out[ ]:
Model R2 Score RMSE MSE MAE MAPE SMAPE
0 RandomForestRegressor 0.8953 🏆 229.5118 🏆 53085.1121 🏆 130.2841 🥉 1.2182 🥈 1.2143 🥈
1 GradientBoostingRegressor 0.856 268.9559 73098.1024 179.3228 1.6784 1.6704
2 AdaBoostRegressor 0.7273 371.5134 138157.1979 272.4418 2.5413 2.5364
3 BaggingRegressor 0.8952 🥈 229.5995 🥈 53129.3674 🥈 130.0752 🏆 1.2162 🏆 1.2124 🏆
4 ExtraTreesRegressor 0.8903 234.6574 55478.9068 130.2439 🥈 1.2183 🥉 1.2145 🥉
5 LinearRegression 0.8335 289.4355 84430.4047 199.5315 1.8616 1.8533
6 Ridge 0.8334 289.4616 84442.4363 199.5924 1.8621 1.8539
7 Lasso 0.8332 289.6671 84553.998 200.0872 1.8667 1.8585
8 ElasticNet 0.8065 312.5823 98204.519 226.3124 2.1056 2.0976
9 BayesianRidge 0.8331 289.7846 84611.4152 199.9467 1.8652 1.857
10 ARDRegression 0.8332 289.6515 84555.3324 199.9103 1.8651 1.8569
11 PassiveAggressiveRegressor 0.8235 297.9253 89457.7342 200.3422 1.8712 1.8609
12 HuberRegressor 0.831 291.5693 85663.8733 196.1196 1.8327 1.8231
13 TheilSenRegressor -4.1268 1576.5451 2562624.5787 537.2081 4.9534 6.559
14 RANSACRegressor 0.6866 393.6625 157507.5776 237.2946 2.2091 2.2059
15 SVR 0.3097 591.3392 349863.4979 441.7173 4.0401 4.0742
16 LinearSVR -148.773 8705.4732 75785957.276 8674.241 80.5311 134.9252
17 KNeighborsRegressor 0.8636 262.4784 69167.6763 156.3438 1.4572 1.452
18 MLPRegressor -20.4212 3289.0527 10837852.3505 2582.4927 23.9279 28.698
19 DecisionTreeRegressor 0.8919 232.9797 54802.9402 130.9895 1.2258 1.2213
20 XGBRegressor 0.8932 🥉 231.5479 🥉 54148.5461 🥉 131.659 1.2321 1.2269
21 LGBMRegressor 0.8814 244.2603 60211.5009 148.7239 1.3882 1.382
22 KernelRidge -234.6551 10917.1252 119188290.7061 10891.1801 101.5747 190.5893

Predicting Back

¶

In [ ]:
y = target_ActualWeightBack
X_transformed = data_prep_pipeline.fit_transform(X, y)

results_df_total = evaluate_models(models, X_transformed, y)
results_df_total
✔️ RandomForestRegressor
✔️ GradientBoostingRegressor
✔️ AdaBoostRegressor
✔️ BaggingRegressor
✔️ ExtraTreesRegressor
✔️ LinearRegression
✔️ Ridge
✔️ Lasso
✔️ ElasticNet
✔️ BayesianRidge
✔️ ARDRegression
✔️ PassiveAggressiveRegressor
✔️ HuberRegressor
✔️ TheilSenRegressor
✔️ RANSACRegressor
✔️ SVR
✔️ LinearSVR
✔️ KNeighborsRegressor
✔️ MLPRegressor
✔️ DecisionTreeRegressor
✔️ XGBRegressor
✔️ LGBMRegressor
✔️ KernelRidge
Out[ ]:
Model R2 Score RMSE MSE MAE MAPE SMAPE
0 RandomForestRegressor 0.7872 🥉 410.7583 🥉 169074.8226 🥉 244.1078 🥉 3.4455 🥉 3.4046 🥉
1 GradientBoostingRegressor 0.6762 507.7242 258141.6621 358.9169 5.0233 4.9609
2 AdaBoostRegressor 0.523 615.0411 379359.7921 462.0963 6.5471 6.4033
3 BaggingRegressor 0.7879 🥈 410.0674 🥈 168522.2289 🥈 243.7857 🥈 3.4402 🥈 3.4002 🥈
4 ExtraTreesRegressor 0.7827 415.1815 172686.9054 244.5569 3.4479 3.4095
5 LinearRegression 0.471 648.7058 421243.5064 505.3203 7.0058 6.9315
6 Ridge 0.4711 648.6939 421226.9838 505.336 7.0059 6.9317
7 Lasso 0.4713 648.5511 421031.8387 505.5588 7.0075 6.9341
8 ElasticNet 0.4419 666.7692 444789.351 523.9646 7.2307 7.1718
9 BayesianRidge 0.4712 648.6101 421097.628 505.7998 7.0098 6.9371
10 ARDRegression 0.4717 648.4037 420813.7174 505.6685 7.0061 6.9347
11 PassiveAggressiveRegressor 0.4531 659.9559 435922.7762 506.6436 7.0153 6.9496
12 HuberRegressor 0.4649 652.5355 426176.4484 505.1507 7.0135 6.9393
13 TheilSenRegressor -0.4491 1054.0084 1143200.2552 664.7454 9.2305 10.3724
14 RANSACRegressor 0.1243 832.885 695611.9439 605.0011 8.4006 8.3753
15 SVR 0.0616 865.4469 749460.4848 667.5756 9.1126 9.2104
16 LinearSVR -32.7431 5184.405 26879375.009 5104.7977 70.5015 109.1993
17 KNeighborsRegressor 0.7301 462.1175 214393.8754 277.6469 3.9305 3.8735
18 MLPRegressor -1.4216 1383.7872 1920399.3398 1065.5495 14.9436 16.3003
19 DecisionTreeRegressor 0.7647 431.59 186756.6538 250.1442 3.5247 3.4848
20 XGBRegressor 0.7918 🏆 406.6225 🏆 165583.6396 🏆 243.4098 🏆 3.4342 🏆 3.3929 🏆
21 LGBMRegressor 0.7686 429.0368 184357.0731 272.2088 3.8217 3.7785
22 KernelRidge -66.1968 7316.1791 53527705.4363 7273.5247 101.8426 188.0728

💡 Findings


We only need to predict the two best outputs, and then we can infer the third.

Although the R2 score is very similar between the Back and Total, the SMAPE score for the Total is much better.

Therefore we will try use the random forest for the front and XGBoost for the total.

👨‍🔬 Feature Engineering

¶

Feature Engineer Front Model

¶

In [ ]:
X = training.drop(columns= targetColumns  + ['TruckSID'])
y = target_ActualWeightFront
X_transformed = data_prep_pipeline.fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=0)

🧠 A good idea

Let's get a baseline for how well our model is performing.

In [ ]:
model = XGBRegressor(random_state=42)
scores = -cross_val_score(model, X_train, y_train, cv=5, scoring="neg_mean_absolute_error")
print(scores)
print("MAE score: {}".format(np.mean(scores)))
[109.83327439 140.11404666 123.61776288 129.49583773 140.21197477]
MAE score: 128.65457928631758

🧠 A good idea

Let's start by making as many features as we can think of, then then start reducing them until we get the perfect score.

In [ ]:
def FE(Training):
    '''
    # Position Analysis:
    Training['sum_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('sum')
    Training['max_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('max')
    Training['min_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('min')
    Training['std_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('std')

    Training['sum_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('sum')
    Training['max_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('max')
    Training['min_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('min')
    Training['std_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('std')
    Training['avg_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('mean')
    Training['avg_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('mean')

    # Position Analysis (Wheelbase and Overhang):
    Training['sum_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('sum')
    Training['max_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('max')
    Training['min_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('min')
    Training['std_wheelbase_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['WheelBase'].transform('std')

    Training['sum_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('sum')
    Training['max_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('max')
    Training['min_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('min')
    Training['std_overhang_per_front_axle_position'] = Training.groupby('FrontAxlePosition')['Overhang'].transform('std')
    '''
    
    # Interaction Features:
    Training['Engine_Transmission'] = Training['Engine'] * Training['Transmission']
    Training['TransmissionFamily_EngineFamily'] = Training['TransmissionFamily'] * Training['EngineFamily']
    
    # Polynomial Features for numeric variables:
    Training['WheelBase_squared'] = Training['WheelBase'] ** 2
    Training['Overhang_squared'] = Training['Overhang'] ** 2
    
    # Ratio Features:
    Training['Front_to_Rear_Wheels'] = Training['FrontWheels'] / (Training['RearWheels'] + 0.001)  # Add a small number to avoid division by zero
    Training['WheelBase_to_Overhang'] = Training['WheelBase'] / (Training['Overhang'] + 0.001)

    # Aggregated Features for TransmissionFamily and EngineFamily:
    Training['avg_WheelBase_per_TransmissionFamily'] = Training.groupby('TransmissionFamily')['WheelBase'].transform('mean')
    Training['avg_Overhang_per_EngineFamily'] = Training.groupby('EngineFamily')['Overhang'].transform('mean')
    
    # Features based on other columns:
    Training['sum_WheelBase_per_Engine'] = Training.groupby('Engine')['WheelBase'].transform('sum')
    Training['max_Overhang_per_Transmission'] = Training.groupby('Transmission')['Overhang'].transform('max')
    
    #Training['Transmission_EngineFamily'] = Training['Transmission'] * Training['EngineFamily']
    
    '''
    # Standard deviation relative to the mean (Coefficient of Variation):
    Training['cv_WheelBase_per_FrontAxlePosition'] = Training['std_wheelbase_per_front_axle_position'] / (Training['avg_wheelbase_per_front_axle_position'] + 0.001)
    Training['cv_Overhang_per_FrontAxlePosition'] = Training['std_overhang_per_front_axle_position'] / (Training['avg_overhang_per_front_axle_position'] + 0.001)
    
    #Interactions with Top Features:
    Training['Engine_TransmissionFamily'] = Training['Engine'] * Training['TransmissionFamily']
    
    
    #Aggregations with Other Features:
    Training['mean_WheelBase_per_Transmission'] = Training.groupby('Transmission')['WheelBase'].transform('mean')
    Training['mean_Overhang_per_EngineFamily'] = Training.groupby('EngineFamily')['Overhang'].transform('mean')
    Training['sum_FrontWheels_per_Transmission'] = Training.groupby('Transmission')['FrontWheels'].transform('sum')
    
    #Cumulative Sum and Diff Features:
    # Assuming some sort of order (like time), adjust as necessary
    Training['cumsum_WheelBase'] = Training['WheelBase'].cumsum()
    Training['cumsum_Overhang'] = Training['Overhang'].cumsum()
    Training['diff_WheelBase'] = Training['WheelBase'].diff()
    Training['diff_Overhang'] = Training['Overhang'].diff()
    
    #Bin-based Features:
    Training['WheelBase_bins'] = pd.cut(Training['WheelBase'], bins=5, labels=False)  # 5 bins, can adjust
    Training['mean_Overhang_per_WheelBase_bins'] = Training.groupby('WheelBase_bins')['Overhang'].transform('mean')
    
    '''
    
    return Training
In [ ]:
X_train_fe = FE(X_train)
X_test_fe = FE(X_test)

🧠 A good idea

All the features where graphed to help find the most useful ones.

In [ ]:
model = XGBRegressor(random_state=42) 
model.fit(X_train_fe, y_train)
scores = -cross_val_score(model, X_train_fe, y_train, cv=5, scoring="neg_mean_absolute_error")
print(scores)
print("Mean:", scores.mean())
xgb.plot_importance(model)
plt.show()
[109.93013091 138.35848818 126.57804054 129.82614548 140.92304951]
Mean: 129.12317092483107
In [ ]:
feature_importances = model.feature_importances_

# Create a DataFrame for better manipulation
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})
importance_df_sorted =  importance_df.sort_values(by='Importance', ascending=False)
importance_df_sorted.to_csv('importance_df_sorted.csv', index=False)
importance_df_sorted
Out[ ]:
Feature Importance
18 TransmissionFamily 0.368260
1 Transmission 0.233128
17 EngineFamily 0.129587
0 Engine 0.105301
16 TagAxle 0.042307
14 FrontWheels 0.041407
6 Liner 0.019834
24 WheelBase_to_Overhang 0.007705
4 Overhang 0.006767
3 WheelBase 0.006328
8 Cab 0.004250
28 max_Overhang_per_Transmission 0.004218
26 avg_Overhang_per_EngineFamily 0.003950
21 WheelBase_squared 0.003193
12 RearWheels 0.002884
5 FrameRails 0.002641
10 RearSusp 0.002446
19 Engine_Transmission 0.002134
23 Front_to_Rear_Wheels 0.002083
22 Overhang_squared 0.002062
13 RearTires 0.001861
20 TransmissionFamily_EngineFamily 0.001588
11 FrontSusp 0.001357
9 RearAxels 0.001317
15 FrontTires 0.001300
27 sum_WheelBase_per_Engine 0.000877
2 FrontAxlePosition 0.000817
7 FrontEndExt 0.000398
25 avg_WheelBase_per_TransmissionFamily 0.000000
In [ ]:
model = xgb.train({"learning_rate": 0.1}, xgb.DMatrix(X_train_fe, label=y_train), 100)

# Specifically for XGBoost, you can use the TreeExplainer which works best for tree-based models
explainer = shap.TreeExplainer(model)

# Compute SHAP values for a sample of the test set
shap_values = explainer.shap_values(X_test_fe)

# Visualize the first prediction's explanation
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0], X_test_fe.iloc[0])

shap.summary_plot(shap_values, X_test_fe)
# For a dependence plot for a specific feature 
for i in range(5):
    feature_name = X_test.columns[i]  # get the name of the first feature
    shap.dependence_plot(feature_name, shap_values, X_test_fe)

🧠 A good idea

Features where filtered out based on an arbitrary number I fiddled with until I got the best results

In [ ]:
# Filter out features with importance less importance
#0.001927
selected_features = importance_df[importance_df['Importance'] >= 0.001927]['Feature'].tolist()

# Subset the dataset
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
In [ ]:
model = XGBRegressor(random_state=42)  # or xgb.XGBClassifier() for classification
scores = -cross_val_score(model, X_train_selected, y_train, cv=5, scoring="neg_mean_absolute_error")

print(scores)
print("Mean:", scores.mean())
[108.81307802 141.39698849 126.0712178  130.55963894 137.75222762]
Mean: 128.9186301731419

💡 Findings

We see a decent difference with the new and improved features.

In [ ]:
X_train_selected.to_csv('X_train_selected.csv', index=False)
front_selected_columns = X_train_selected.columns
front_selected_columns
Out[ ]:
Index(['Engine', 'Transmission', 'WheelBase', 'Overhang', 'FrameRails',
       'Liner', 'Cab', 'RearSusp', 'RearWheels', 'FrontWheels', 'TagAxle',
       'EngineFamily', 'TransmissionFamily', 'Engine_Transmission',
       'WheelBase_squared', 'Overhang_squared', 'Front_to_Rear_Wheels',
       'WheelBase_to_Overhang', 'avg_Overhang_per_EngineFamily',
       'max_Overhang_per_Transmission'],
      dtype='object')

💡 Findings

These were deemed the important features

In [ ]:
X_transformed_fe = FE(X_transformed)
X_transformed_fe = X_transformed_fe[front_selected_columns]
X_transformed_fe
Out[ ]:
Engine Transmission WheelBase Overhang FrameRails Liner Cab RearSusp RearWheels FrontWheels TagAxle EngineFamily TransmissionFamily Engine_Transmission WheelBase_squared Overhang_squared Front_to_Rear_Wheels WheelBase_to_Overhang avg_Overhang_per_EngineFamily max_Overhang_per_Transmission
0 -0.775150 1.199613 2.620206 0.867729 0.647233 1.118745 -0.946623 0.251651 1.262227 1.224191 -0.15539 -0.774318 1.059240 -0.929880 6.865481 0.752953 0.969098 3.016139 -0.089933 2.373981
1 -0.775150 -0.917003 -1.545882 -1.391651 0.647233 1.118745 1.193472 0.251651 -0.317981 0.801290 -0.15539 -0.774318 -0.944073 0.710815 2.389750 1.936692 -2.527878 1.111625 -0.089933 1.244292
2 0.194674 -0.917003 0.537162 -1.391651 0.647233 1.118745 -0.408809 0.251651 1.262227 1.224191 -0.15539 0.283715 -0.944073 -0.178516 0.288543 1.936692 0.969098 -0.386267 -0.073598 1.244292
3 0.270655 1.199613 0.726530 0.867729 0.647233 1.118745 -0.946623 0.251651 1.262227 1.224191 -0.15539 0.283715 1.059240 0.324682 0.527846 0.752953 0.969098 0.836314 -0.073598 2.373981
4 1.580971 1.199613 1.484000 0.867729 0.647233 1.118745 -0.408809 0.251651 -0.317981 -0.856580 -0.15539 1.551007 1.059240 1.896553 2.202257 0.752953 2.702304 1.708244 0.409488 2.373981
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2639 -0.175652 -0.231394 0.158427 0.867729 0.647233 -0.906376 -0.946623 -3.511279 1.691276 1.224191 -0.15539 -0.774318 -0.944073 0.040645 0.025099 0.752953 0.723399 0.182367 -0.089933 0.867729
2640 0.270655 1.199613 0.158427 -1.015088 0.647233 1.118745 2.089485 0.251651 1.262227 1.224191 -0.15539 0.283715 1.059240 0.324682 0.025099 1.030403 0.969098 -0.156226 -0.073598 2.373981
2641 0.270655 1.199613 0.915898 -0.638524 0.647233 -0.906376 -0.946623 0.517296 1.262227 1.224191 -0.15539 0.283715 1.059240 0.324682 0.838868 0.407713 0.969098 -1.436647 -0.073598 2.373981
2642 -0.775150 -0.917003 0.915898 -2.144777 0.647233 1.118745 -0.408809 0.251651 -0.721544 -1.037440 -0.15539 -0.774318 -0.944073 0.710815 0.838868 4.600069 1.439802 -0.427235 -0.089933 1.244292
2643 -0.775150 -0.917003 -0.599043 0.867729 0.647233 -0.906376 -0.946623 0.251651 -0.317981 -0.856580 -0.15539 -0.774318 -0.944073 0.710815 0.358853 0.752953 2.702304 -0.689563 -0.089933 1.244292

2644 rows × 20 columns

🧠 A good idea

Let's being optimizing the hyperparameters.
We will use optuna to make this process really easy and reliable.

In [ ]:
import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X = X_transformed_fe
y = target_ActualWeightFront

def objective(trial):
    # Define the hyperparameter search space
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 10, 1000),
        "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
        "max_depth": trial.suggest_int("max_depth", 1, 10),
        "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-9, 100),
        "subsample": trial.suggest_float("subsample", 0.1, 1),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.1, 1),
        "gamma": trial.suggest_float("gamma", 0, 1),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
    }

    # Split the data into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

    # Create and train the XGBRegressor with the suggested hyperparameters
    model = XGBRegressor(**params, eval_metric='mae',random_state=42,n_jobs=1)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, verbose=False)

    # Calculate the MAE on the validation set
    y_pred = model.predict(X_val)
    mae = mean_absolute_error(y_val, y_pred)

    return mae

# Create a study object and specify that the goal is to minimize the objective function
study = optuna.create_study(direction="minimize")  # We want to minimize the MAE
study.optimize(objective, n_trials=100)

# Get the best hyperparameters and their corresponding MAE
best_params = study.best_params
best_score = study.best_value

print("Best Hyperparameters:", best_params)
print("Best Score (MAE):", best_score)
[I 2023-08-24 21:37:15,634] A new study created in memory with name: no-name-7e8eefa7-d481-4023-86e8-3118d3736533
[I 2023-08-24 21:37:16,635] Trial 0 finished with value: 132.4071453872796 and parameters: {'n_estimators': 778, 'learning_rate': 0.04566242473875747, 'max_depth': 3, 'reg_lambda': 0.5835957689324223, 'subsample': 0.8579024686544837, 'colsample_bytree': 0.6518396226535519, 'gamma': 0.4926660807691057, 'min_child_weight': 7}. Best is trial 0 with value: 132.4071453872796.
[I 2023-08-24 21:37:17,127] Trial 1 finished with value: 125.43281643576826 and parameters: {'n_estimators': 515, 'learning_rate': 0.23491467286748102, 'max_depth': 4, 'reg_lambda': 0.16328659509960294, 'subsample': 0.8753910226197854, 'colsample_bytree': 0.7038019461183781, 'gamma': 0.4183235191057644, 'min_child_weight': 9}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:17,558] Trial 2 finished with value: 135.5872732013539 and parameters: {'n_estimators': 178, 'learning_rate': 0.03834613358805908, 'max_depth': 10, 'reg_lambda': 1.4484118111965656, 'subsample': 0.7816268862422202, 'colsample_bytree': 0.40808698044572067, 'gamma': 0.39963155713820864, 'min_child_weight': 7}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:18,483] Trial 3 finished with value: 132.70857972882555 and parameters: {'n_estimators': 884, 'learning_rate': 0.04835611501837917, 'max_depth': 5, 'reg_lambda': 4.062813961076248, 'subsample': 0.18725584879759472, 'colsample_bytree': 0.6563565441068906, 'gamma': 0.9047692372971008, 'min_child_weight': 2}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:18,533] Trial 4 finished with value: 2113.9746930100755 and parameters: {'n_estimators': 41, 'learning_rate': 0.039154814051259, 'max_depth': 1, 'reg_lambda': 2.3951979690535525, 'subsample': 0.4495753804042988, 'colsample_bytree': 0.787101520408692, 'gamma': 0.11595299260917791, 'min_child_weight': 3}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:18,827] Trial 5 finished with value: 142.09308829896096 and parameters: {'n_estimators': 893, 'learning_rate': 0.2214655779293194, 'max_depth': 2, 'reg_lambda': 1.8893190620594973, 'subsample': 0.4157732545791466, 'colsample_bytree': 0.2636768600669744, 'gamma': 0.7603688601700967, 'min_child_weight': 9}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:19,873] Trial 6 finished with value: 140.8227526763224 and parameters: {'n_estimators': 676, 'learning_rate': 0.021881471821074625, 'max_depth': 10, 'reg_lambda': 2.3422047333344067e-07, 'subsample': 0.2339219371634801, 'colsample_bytree': 0.18800371757749024, 'gamma': 0.21128796141900152, 'min_child_weight': 9}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:21,125] Trial 7 finished with value: 126.54438316671914 and parameters: {'n_estimators': 879, 'learning_rate': 0.0389904280064983, 'max_depth': 5, 'reg_lambda': 1.4567835664186264e-06, 'subsample': 0.827862799711435, 'colsample_bytree': 0.8298477528098921, 'gamma': 0.5016726585159476, 'min_child_weight': 5}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:23,140] Trial 8 finished with value: 127.91857633422543 and parameters: {'n_estimators': 723, 'learning_rate': 0.02545933982843321, 'max_depth': 7, 'reg_lambda': 0.01613257226273038, 'subsample': 0.6996558998095475, 'colsample_bytree': 0.8148251539719638, 'gamma': 0.5725206576316784, 'min_child_weight': 9}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:23,165] Trial 9 finished with value: 8723.411002132694 and parameters: {'n_estimators': 20, 'learning_rate': 0.01060863042706886, 'max_depth': 2, 'reg_lambda': 9.821450325011904, 'subsample': 0.23024556388926223, 'colsample_bytree': 0.49175589952144083, 'gamma': 0.9607273045095055, 'min_child_weight': 6}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:23,571] Trial 10 finished with value: 127.22424531643577 and parameters: {'n_estimators': 436, 'learning_rate': 0.26027780415517815, 'max_depth': 7, 'reg_lambda': 0.0018701442205215858, 'subsample': 0.9610890609392372, 'colsample_bytree': 0.9542696645437158, 'gamma': 0.019810948147209073, 'min_child_weight': 10}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:24,055] Trial 11 finished with value: 127.62992585996537 and parameters: {'n_estimators': 443, 'learning_rate': 0.10857823212925481, 'max_depth': 5, 'reg_lambda': 7.660758884408461e-07, 'subsample': 0.9725624124669809, 'colsample_bytree': 0.9903055231531217, 'gamma': 0.31326860236484316, 'min_child_weight': 4}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:24,913] Trial 12 finished with value: 127.99051233863351 and parameters: {'n_estimators': 541, 'learning_rate': 0.09727830958598613, 'max_depth': 4, 'reg_lambda': 3.5129132131749995e-05, 'subsample': 0.6665608044558613, 'colsample_bytree': 0.7731189013828259, 'gamma': 0.599888577286736, 'min_child_weight': 5}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:25,405] Trial 13 finished with value: 126.03880052542506 and parameters: {'n_estimators': 986, 'learning_rate': 0.09182526748345886, 'max_depth': 7, 'reg_lambda': 3.2495765048162434e-09, 'subsample': 0.8326357233970464, 'colsample_bytree': 0.6189429741728897, 'gamma': 0.357170790031503, 'min_child_weight': 1}. Best is trial 1 with value: 125.43281643576826.
[I 2023-08-24 21:37:25,901] Trial 14 finished with value: 124.94192773929471 and parameters: {'n_estimators': 997, 'learning_rate': 0.15130810686401724, 'max_depth': 8, 'reg_lambda': 1.228545322169782e-09, 'subsample': 0.6468054785776172, 'colsample_bytree': 0.5768366345064918, 'gamma': 0.31555877415318534, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:26,548] Trial 15 finished with value: 127.12612169395466 and parameters: {'n_estimators': 335, 'learning_rate': 0.17342552513692705, 'max_depth': 8, 'reg_lambda': 0.03275982325966444, 'subsample': 0.6021999568376306, 'colsample_bytree': 0.49886572533716333, 'gamma': 0.21491614902941425, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:27,040] Trial 16 finished with value: 129.34766854927582 and parameters: {'n_estimators': 551, 'learning_rate': 0.17234331057086902, 'max_depth': 9, 'reg_lambda': 3.041746635728718e-09, 'subsample': 0.559509304164145, 'colsample_bytree': 0.386863594629731, 'gamma': 0.27593218996539043, 'min_child_weight': 7}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:27,661] Trial 17 finished with value: 129.30547219379724 and parameters: {'n_estimators': 291, 'learning_rate': 0.2689946404503632, 'max_depth': 6, 'reg_lambda': 54.382798309821496, 'subsample': 0.7310685750898838, 'colsample_bytree': 0.5810311633446974, 'gamma': 0.4065854106345552, 'min_child_weight': 3}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:28,108] Trial 18 finished with value: 131.40859301204344 and parameters: {'n_estimators': 626, 'learning_rate': 0.2963397681364949, 'max_depth': 8, 'reg_lambda': 0.00033949064294032166, 'subsample': 0.6473234385427341, 'colsample_bytree': 0.6981658806409754, 'gamma': 0.1482523193285713, 'min_child_weight': 8}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:28,471] Trial 19 finished with value: 127.29369686712846 and parameters: {'n_estimators': 227, 'learning_rate': 0.14522763766714922, 'max_depth': 4, 'reg_lambda': 1.3505736702640705e-05, 'subsample': 0.9098958042345393, 'colsample_bytree': 0.7055939076358009, 'gamma': 0.6894163570889831, 'min_child_weight': 10}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:29,223] Trial 20 finished with value: 126.28532352015114 and parameters: {'n_estimators': 984, 'learning_rate': 0.07413785519030056, 'max_depth': 6, 'reg_lambda': 7.193334213603787e-08, 'subsample': 0.7463209822933622, 'colsample_bytree': 0.5611324505402174, 'gamma': 0.41173879404823976, 'min_child_weight': 4}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:29,646] Trial 21 finished with value: 127.75240942813288 and parameters: {'n_estimators': 952, 'learning_rate': 0.11994324642256102, 'max_depth': 7, 'reg_lambda': 1.1767466018724594e-09, 'subsample': 0.861606838355318, 'colsample_bytree': 0.6001269936570764, 'gamma': 0.3362459109779751, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:29,952] Trial 22 finished with value: 127.93821704778023 and parameters: {'n_estimators': 789, 'learning_rate': 0.19899958278162685, 'max_depth': 8, 'reg_lambda': 8.078922061229748e-09, 'subsample': 0.793561168667525, 'colsample_bytree': 0.499708058772754, 'gamma': 0.3408668097590281, 'min_child_weight': 2}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:30,696] Trial 23 finished with value: 127.8188341270466 and parameters: {'n_estimators': 996, 'learning_rate': 0.13779323967681825, 'max_depth': 9, 'reg_lambda': 1.644571347336594e-08, 'subsample': 0.9113612925586896, 'colsample_bytree': 0.6031829309851063, 'gamma': 0.2644199170082564, 'min_child_weight': 2}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:31,399] Trial 24 finished with value: 127.17609857131612 and parameters: {'n_estimators': 850, 'learning_rate': 0.07905879059679004, 'max_depth': 4, 'reg_lambda': 3.279426984842617e-08, 'subsample': 0.7856913006335435, 'colsample_bytree': 0.7281678496908335, 'gamma': 0.45504826807035675, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:31,775] Trial 25 finished with value: 127.74823013420969 and parameters: {'n_estimators': 639, 'learning_rate': 0.15584053836129788, 'max_depth': 6, 'reg_lambda': 1.0324808412445763e-09, 'subsample': 0.9802292936363333, 'colsample_bytree': 0.8828570295216268, 'gamma': 0.34596658642839867, 'min_child_weight': 3}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:32,408] Trial 26 finished with value: 129.31180878069898 and parameters: {'n_estimators': 786, 'learning_rate': 0.20035718587280305, 'max_depth': 9, 'reg_lambda': 8.977490603162365e-09, 'subsample': 0.6938292297428492, 'colsample_bytree': 0.7409471402421565, 'gamma': 0.4764061710433398, 'min_child_weight': 4}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:33,153] Trial 27 finished with value: 126.67401015428212 and parameters: {'n_estimators': 936, 'learning_rate': 0.12282767501032768, 'max_depth': 7, 'reg_lambda': 1.3091710090271623e-07, 'subsample': 0.8583142139258471, 'colsample_bytree': 0.6587728722515241, 'gamma': 0.5541906188173018, 'min_child_weight': 6}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:33,805] Trial 28 finished with value: 132.06898048842885 and parameters: {'n_estimators': 471, 'learning_rate': 0.0915818160375384, 'max_depth': 3, 'reg_lambda': 3.0240136734926108e-06, 'subsample': 0.7615495721964808, 'colsample_bytree': 0.8759651213263617, 'gamma': 0.41750021194833165, 'min_child_weight': 2}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:34,551] Trial 29 finished with value: 132.10209505864296 and parameters: {'n_estimators': 810, 'learning_rate': 0.06365378018955183, 'max_depth': 3, 'reg_lambda': 3.075498707965548e-08, 'subsample': 0.8614788132675466, 'colsample_bytree': 0.6550870329399731, 'gamma': 0.4879184030426139, 'min_child_weight': 8}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:35,079] Trial 30 finished with value: 125.50037512791246 and parameters: {'n_estimators': 726, 'learning_rate': 0.2342678663443616, 'max_depth': 8, 'reg_lambda': 0.11069152142306457, 'subsample': 0.6364723456460079, 'colsample_bytree': 0.6202235579891875, 'gamma': 0.3648502758139992, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:35,442] Trial 31 finished with value: 125.97189123504408 and parameters: {'n_estimators': 680, 'learning_rate': 0.21092503617278432, 'max_depth': 8, 'reg_lambda': 0.054406498286065044, 'subsample': 0.6339828884054357, 'colsample_bytree': 0.6211182926778845, 'gamma': 0.3699007721085756, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:35,988] Trial 32 finished with value: 126.10003984965365 and parameters: {'n_estimators': 731, 'learning_rate': 0.22556085420214375, 'max_depth': 10, 'reg_lambda': 0.22075025181980001, 'subsample': 0.6422659200143813, 'colsample_bytree': 0.5473009901649993, 'gamma': 0.3801001597057148, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:36,451] Trial 33 finished with value: 131.49591787035578 and parameters: {'n_estimators': 562, 'learning_rate': 0.29616535915168607, 'max_depth': 8, 'reg_lambda': 0.16547199912174967, 'subsample': 0.5904974425459516, 'colsample_bytree': 0.6755489682556939, 'gamma': 0.27713272176225157, 'min_child_weight': 2}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:36,840] Trial 34 finished with value: 126.23389286838791 and parameters: {'n_estimators': 707, 'learning_rate': 0.2302517191076446, 'max_depth': 9, 'reg_lambda': 0.0070570853182569835, 'subsample': 0.5169166182926936, 'colsample_bytree': 0.7369361951376336, 'gamma': 0.41569010064862805, 'min_child_weight': 3}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:37,338] Trial 35 finished with value: 126.67065614176637 and parameters: {'n_estimators': 616, 'learning_rate': 0.1700895480673002, 'max_depth': 8, 'reg_lambda': 0.5184641086504166, 'subsample': 0.7145590172052676, 'colsample_bytree': 0.6324981108174565, 'gamma': 0.5184779507402361, 'min_child_weight': 2}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:37,864] Trial 36 finished with value: 126.63194048134446 and parameters: {'n_estimators': 360, 'learning_rate': 0.19865963591983107, 'max_depth': 10, 'reg_lambda': 0.08694538127564799, 'subsample': 0.5111828015207219, 'colsample_bytree': 0.6852301917549287, 'gamma': 0.45259755387297074, 'min_child_weight': 8}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:38,226] Trial 37 finished with value: 127.72103520544711 and parameters: {'n_estimators': 497, 'learning_rate': 0.24869046937369327, 'max_depth': 6, 'reg_lambda': 0.8154092149790162, 'subsample': 0.6396487676063962, 'colsample_bytree': 0.6366047365057361, 'gamma': 0.3037704019047905, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:38,520] Trial 38 finished with value: 133.78171368269835 and parameters: {'n_estimators': 665, 'learning_rate': 0.139585866366377, 'max_depth': 5, 'reg_lambda': 0.055487274882132925, 'subsample': 0.41848757937267045, 'colsample_bytree': 0.5570471108610044, 'gamma': 0.22306362291912624, 'min_child_weight': 2}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:38,643] Trial 39 finished with value: 171.98024367325252 and parameters: {'n_estimators': 119, 'learning_rate': 0.20142547618458867, 'max_depth': 1, 'reg_lambda': 0.003686998021917772, 'subsample': 0.68993710660369, 'colsample_bytree': 0.436511353459194, 'gamma': 0.37579339831295067, 'min_child_weight': 3}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:39,162] Trial 40 finished with value: 126.42065737169395 and parameters: {'n_estimators': 383, 'learning_rate': 0.25047125589255326, 'max_depth': 9, 'reg_lambda': 0.0011585704890211242, 'subsample': 0.4728127379765559, 'colsample_bytree': 0.7482256313362939, 'gamma': 0.15458495453162724, 'min_child_weight': 7}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:39,611] Trial 41 finished with value: 126.68262702692066 and parameters: {'n_estimators': 903, 'learning_rate': 0.1691805003250567, 'max_depth': 7, 'reg_lambda': 0.026394462451432413, 'subsample': 0.7684966825687314, 'colsample_bytree': 0.6127117427065627, 'gamma': 0.36438609047202186, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:40,133] Trial 42 finished with value: 125.60597548016372 and parameters: {'n_estimators': 857, 'learning_rate': 0.12546314852907015, 'max_depth': 8, 'reg_lambda': 0.007183940573932222, 'subsample': 0.6065792848746795, 'colsample_bytree': 0.6391509764193508, 'gamma': 0.3080311787931148, 'min_child_weight': 1}. Best is trial 14 with value: 124.94192773929471.
[I 2023-08-24 21:37:40,681] Trial 43 finished with value: 124.78965655502203 and parameters: {'n_estimators': 858, 'learning_rate': 0.1301914704241828, 'max_depth': 8, 'reg_lambda': 0.010074944155120275, 'subsample': 0.599725649219366, 'colsample_bytree': 0.6956601219005432, 'gamma': 0.30697906029172484, 'min_child_weight': 1}. Best is trial 43 with value: 124.78965655502203.
[I 2023-08-24 21:37:41,185] Trial 44 finished with value: 138.90886236618388 and parameters: {'n_estimators': 840, 'learning_rate': 0.1194690794989917, 'max_depth': 2, 'reg_lambda': 0.010796484962367237, 'subsample': 0.5793816890340173, 'colsample_bytree': 0.7992601916648998, 'gamma': 0.3099112793794747, 'min_child_weight': 2}. Best is trial 43 with value: 124.78965655502203.
[I 2023-08-24 21:37:41,888] Trial 45 finished with value: 126.189182540932 and parameters: {'n_estimators': 909, 'learning_rate': 0.14092622528316182, 'max_depth': 10, 'reg_lambda': 0.0007643143265683949, 'subsample': 0.5360575920729431, 'colsample_bytree': 0.6860112521164702, 'gamma': 0.2467370808611326, 'min_child_weight': 1}. Best is trial 43 with value: 124.78965655502203.
[I 2023-08-24 21:37:42,461] Trial 46 finished with value: 125.23514616459383 and parameters: {'n_estimators': 748, 'learning_rate': 0.11108075777632465, 'max_depth': 8, 'reg_lambda': 0.004364429749935437, 'subsample': 0.6164735505650625, 'colsample_bytree': 0.7741222303427062, 'gamma': 0.18586918457369517, 'min_child_weight': 3}. Best is trial 43 with value: 124.78965655502203.
[I 2023-08-24 21:37:43,039] Trial 47 finished with value: 126.91608696079975 and parameters: {'n_estimators': 743, 'learning_rate': 0.10800585707732638, 'max_depth': 9, 'reg_lambda': 0.0001303685356377775, 'subsample': 0.6727644026139147, 'colsample_bytree': 0.784000545401681, 'gamma': 0.1904953489357061, 'min_child_weight': 3}. Best is trial 43 with value: 124.78965655502203.
[I 2023-08-24 21:37:43,730] Trial 48 finished with value: 129.25931424157744 and parameters: {'n_estimators': 819, 'learning_rate': 0.15967334567727418, 'max_depth': 4, 'reg_lambda': 0.0022539182331560996, 'subsample': 0.5569223305818481, 'colsample_bytree': 0.8404318302431052, 'gamma': 0.11026095283087317, 'min_child_weight': 5}. Best is trial 43 with value: 124.78965655502203.
[I 2023-08-24 21:37:44,170] Trial 49 finished with value: 124.47597090483312 and parameters: {'n_estimators': 754, 'learning_rate': 0.19021426230169275, 'max_depth': 7, 'reg_lambda': 3.306727865458625, 'subsample': 0.7209932624237528, 'colsample_bytree': 0.7638264064206766, 'gamma': 0.23532758520441605, 'min_child_weight': 4}. Best is trial 49 with value: 124.47597090483312.
[I 2023-08-24 21:37:44,671] Trial 50 finished with value: 124.37741557777078 and parameters: {'n_estimators': 753, 'learning_rate': 0.18730972755533032, 'max_depth': 7, 'reg_lambda': 4.150263201191068, 'subsample': 0.7214442686673199, 'colsample_bytree': 0.7713579237492315, 'gamma': 0.24558311289405715, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:45,138] Trial 51 finished with value: 125.7098121162626 and parameters: {'n_estimators': 581, 'learning_rate': 0.18059223601671676, 'max_depth': 7, 'reg_lambda': 4.355119520810934, 'subsample': 0.7231707713762874, 'colsample_bytree': 0.7652425178862647, 'gamma': 0.23725647139373546, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:45,615] Trial 52 finished with value: 125.2506518616184 and parameters: {'n_estimators': 749, 'learning_rate': 0.1516014963700012, 'max_depth': 7, 'reg_lambda': 1.7466315341996252, 'subsample': 0.6941528453805853, 'colsample_bytree': 0.7044794420002095, 'gamma': 0.18897408089386372, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:46,527] Trial 53 finished with value: 128.27068861185452 and parameters: {'n_estimators': 762, 'learning_rate': 0.14343091511975034, 'max_depth': 7, 'reg_lambda': 20.38590846452544, 'subsample': 0.6760926482881967, 'colsample_bytree': 0.8067087470731451, 'gamma': 0.18661634250468895, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:47,592] Trial 54 finished with value: 125.74003758658691 and parameters: {'n_estimators': 939, 'learning_rate': 0.10562626179784838, 'max_depth': 6, 'reg_lambda': 1.8440324083592354, 'subsample': 0.7200035934992797, 'colsample_bytree': 0.7147128741111464, 'gamma': 0.08953477362904216, 'min_child_weight': 6}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:48,429] Trial 55 finished with value: 127.71839578085643 and parameters: {'n_estimators': 881, 'learning_rate': 0.1581042037181086, 'max_depth': 7, 'reg_lambda': 9.333457239154471, 'subsample': 0.6110040931006008, 'colsample_bytree': 0.7770699220228973, 'gamma': 0.17839775215303752, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:48,914] Trial 56 finished with value: 127.16540066120906 and parameters: {'n_estimators': 771, 'learning_rate': 0.18388296619301872, 'max_depth': 8, 'reg_lambda': 0.3457323276553971, 'subsample': 0.7441842535381664, 'colsample_bytree': 0.7191697888375954, 'gamma': 0.2580112959152969, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:49,321] Trial 57 finished with value: 126.8254523673646 and parameters: {'n_estimators': 681, 'learning_rate': 0.1303480518644968, 'max_depth': 6, 'reg_lambda': 0.9472957449457589, 'subsample': 0.7983635594384347, 'colsample_bytree': 0.7505270547229714, 'gamma': 0.2143205517612441, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:49,810] Trial 58 finished with value: 126.38611362562972 and parameters: {'n_estimators': 808, 'learning_rate': 0.15257711217951989, 'max_depth': 7, 'reg_lambda': 2.4979307112482645, 'subsample': 0.695924168719525, 'colsample_bytree': 0.8476518743291745, 'gamma': 0.08044309520787929, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:51,287] Trial 59 finished with value: 127.36896228550063 and parameters: {'n_estimators': 852, 'learning_rate': 0.09836111373583013, 'max_depth': 8, 'reg_lambda': 31.801614749584456, 'subsample': 0.6609952539696674, 'colsample_bytree': 0.7055730369250286, 'gamma': 0.15175481937518404, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:51,882] Trial 60 finished with value: 125.48887284516688 and parameters: {'n_estimators': 908, 'learning_rate': 0.11730244814466667, 'max_depth': 9, 'reg_lambda': 0.2757846883608126, 'subsample': 0.8186247628843653, 'colsample_bytree': 0.8110855190652146, 'gamma': 0.27273079370021114, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:52,457] Trial 61 finished with value: 127.01004850834383 and parameters: {'n_estimators': 700, 'learning_rate': 0.18380038653838035, 'max_depth': 5, 'reg_lambda': 5.17592419859341, 'subsample': 0.764385571073037, 'colsample_bytree': 0.6809648116989024, 'gamma': 0.2171710509595398, 'min_child_weight': 6}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:52,949] Trial 62 finished with value: 125.37953351306675 and parameters: {'n_estimators': 638, 'learning_rate': 0.22067644788605653, 'max_depth': 7, 'reg_lambda': 1.3011722141814206, 'subsample': 0.7291339124454443, 'colsample_bytree': 0.7721187572483466, 'gamma': 0.2909904015976836, 'min_child_weight': 10}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:53,531] Trial 63 finished with value: 125.56025292230794 and parameters: {'n_estimators': 640, 'learning_rate': 0.21185808436499187, 'max_depth': 7, 'reg_lambda': 0.9897291127546658, 'subsample': 0.738579757973617, 'colsample_bytree': 0.7699563022916769, 'gamma': 0.2812343348320005, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:54,118] Trial 64 finished with value: 126.72391323598866 and parameters: {'n_estimators': 602, 'learning_rate': 0.15411394152311392, 'max_depth': 7, 'reg_lambda': 0.01964942567411227, 'subsample': 0.6939386996960284, 'colsample_bytree': 0.7320781399898583, 'gamma': 0.24525916698437117, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:55,744] Trial 65 finished with value: 128.96415991026447 and parameters: {'n_estimators': 757, 'learning_rate': 0.1292091522125983, 'max_depth': 6, 'reg_lambda': 85.98770870798518, 'subsample': 0.6161501958383468, 'colsample_bytree': 0.8207925484476177, 'gamma': 0.3294319485352685, 'min_child_weight': 10}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:56,301] Trial 66 finished with value: 127.09116100244017 and parameters: {'n_estimators': 971, 'learning_rate': 0.268791896141803, 'max_depth': 8, 'reg_lambda': 10.233536155265499, 'subsample': 0.5855221807340311, 'colsample_bytree': 0.6659513832845251, 'gamma': 0.1877866488353676, 'min_child_weight': 6}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:56,860] Trial 67 finished with value: 126.8051622520466 and parameters: {'n_estimators': 792, 'learning_rate': 0.18962802936767245, 'max_depth': 7, 'reg_lambda': 2.136956352148615, 'subsample': 0.650547553623773, 'colsample_bytree': 0.8691035766968243, 'gamma': 0.12570509539226, 'min_child_weight': 9}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:57,453] Trial 68 finished with value: 125.3771425338476 and parameters: {'n_estimators': 829, 'learning_rate': 0.21625555645142713, 'max_depth': 8, 'reg_lambda': 0.31641146123709396, 'subsample': 0.6758175051437016, 'colsample_bytree': 0.7861367236113043, 'gamma': 0.314508456530008, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:58,103] Trial 69 finished with value: 126.76625103313917 and parameters: {'n_estimators': 827, 'learning_rate': 0.17260877195104368, 'max_depth': 9, 'reg_lambda': 0.1436988064416842, 'subsample': 0.6717444643552086, 'colsample_bytree': 0.9190047351584717, 'gamma': 0.31724172134638773, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:58,613] Trial 70 finished with value: 126.43052631061083 and parameters: {'n_estimators': 878, 'learning_rate': 0.13534441383641313, 'max_depth': 8, 'reg_lambda': 0.020229149831253634, 'subsample': 0.6182999944288874, 'colsample_bytree': 0.7129336029923017, 'gamma': 0.24703668993557445, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:59,274] Trial 71 finished with value: 125.99202760941436 and parameters: {'n_estimators': 725, 'learning_rate': 0.22221988649994678, 'max_depth': 8, 'reg_lambda': 0.5543939062984486, 'subsample': 0.7155663207614813, 'colsample_bytree': 0.7879948765518512, 'gamma': 0.28693516165868166, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:37:59,702] Trial 72 finished with value: 125.15466708320214 and parameters: {'n_estimators': 658, 'learning_rate': 0.20078630473606993, 'max_depth': 7, 'reg_lambda': 0.32851245891714786, 'subsample': 0.7454767835897738, 'colsample_bytree': 0.7718127016296609, 'gamma': 0.3320238057130793, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:00,251] Trial 73 finished with value: 127.05605148968829 and parameters: {'n_estimators': 774, 'learning_rate': 0.19418204085947202, 'max_depth': 8, 'reg_lambda': 0.3996205352218638, 'subsample': 0.6624957021108089, 'colsample_bytree': 0.7415488926723343, 'gamma': 0.38956918003500157, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:00,845] Trial 74 finished with value: 127.00477580880037 and parameters: {'n_estimators': 698, 'learning_rate': 0.1568657594683309, 'max_depth': 8, 'reg_lambda': 0.060441513660355585, 'subsample': 0.7749597340694521, 'colsample_bytree': 0.8241096457583382, 'gamma': 0.3274662702144984, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:01,242] Trial 75 finished with value: 129.55909433052582 and parameters: {'n_estimators': 789, 'learning_rate': 0.2414438576927788, 'max_depth': 7, 'reg_lambda': 0.1953128707873193, 'subsample': 0.7526249917836912, 'colsample_bytree': 0.6977447054009538, 'gamma': 0.2205831466453665, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:01,925] Trial 76 finished with value: 125.8181429077456 and parameters: {'n_estimators': 661, 'learning_rate': 0.168233031570811, 'max_depth': 6, 'reg_lambda': 3.0516499799315895, 'subsample': 0.6895505116263969, 'colsample_bytree': 0.7553510761563896, 'gamma': 0.33647993674148724, 'min_child_weight': 2}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:02,537] Trial 77 finished with value: 126.47672362051323 and parameters: {'n_estimators': 745, 'learning_rate': 0.1463984477488365, 'max_depth': 9, 'reg_lambda': 0.03247273528508594, 'subsample': 0.6281318762788439, 'colsample_bytree': 0.6562427699236442, 'gamma': 0.26687154949887565, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:02,982] Trial 78 finished with value: 127.42978343435138 and parameters: {'n_estimators': 957, 'learning_rate': 0.27764832732633016, 'max_depth': 7, 'reg_lambda': 0.09349851795601326, 'subsample': 0.7982234742320704, 'colsample_bytree': 0.79842660876484, 'gamma': 0.3474105977813781, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:03,407] Trial 79 finished with value: 126.75795148181675 and parameters: {'n_estimators': 867, 'learning_rate': 0.19991996322767266, 'max_depth': 6, 'reg_lambda': 0.7856054483056987, 'subsample': 0.7115101567074824, 'colsample_bytree': 0.8392228276813398, 'gamma': 0.29680967590278323, 'min_child_weight': 2}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:03,738] Trial 80 finished with value: 127.3768448913728 and parameters: {'n_estimators': 523, 'learning_rate': 0.24538116887519784, 'max_depth': 8, 'reg_lambda': 0.011360681555715329, 'subsample': 0.603146724150795, 'colsample_bytree': 0.7282256062059301, 'gamma': 0.39194523797990655, 'min_child_weight': 4}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:04,196] Trial 81 finished with value: 125.74013106108312 and parameters: {'n_estimators': 700, 'learning_rate': 0.22622359532609446, 'max_depth': 7, 'reg_lambda': 1.6583776833047204, 'subsample': 0.7408298303973654, 'colsample_bytree': 0.7699584268356066, 'gamma': 0.293069437280325, 'min_child_weight': 5}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:04,734] Trial 82 finished with value: 126.3370628837374 and parameters: {'n_estimators': 598, 'learning_rate': 0.1815708451834071, 'max_depth': 8, 'reg_lambda': 1.0442256083104078, 'subsample': 0.7246567811435929, 'colsample_bytree': 0.7917220874690766, 'gamma': 0.23972452960367874, 'min_child_weight': 7}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:05,488] Trial 83 finished with value: 124.51314669592254 and parameters: {'n_estimators': 647, 'learning_rate': 0.20848651602110835, 'max_depth': 7, 'reg_lambda': 5.864365152232079, 'subsample': 0.6364316782214184, 'colsample_bytree': 0.7655542658021014, 'gamma': 0.2711321099798909, 'min_child_weight': 2}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:06,052] Trial 84 finished with value: 126.48625309941751 and parameters: {'n_estimators': 832, 'learning_rate': 0.1690307086874509, 'max_depth': 8, 'reg_lambda': 5.93475918201282, 'subsample': 0.573688043254779, 'colsample_bytree': 0.7478568850858001, 'gamma': 0.20126602669309385, 'min_child_weight': 2}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:06,529] Trial 85 finished with value: 127.28465074976386 and parameters: {'n_estimators': 721, 'learning_rate': 0.20888459703364087, 'max_depth': 7, 'reg_lambda': 14.014744516398116, 'subsample': 0.6393126693913412, 'colsample_bytree': 0.6944805604865388, 'gamma': 0.1703468601681579, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:06,979] Trial 86 finished with value: 126.42284664278968 and parameters: {'n_estimators': 673, 'learning_rate': 0.14306035021762123, 'max_depth': 7, 'reg_lambda': 3.9421380510283615, 'subsample': 0.6503029203741412, 'colsample_bytree': 0.6735333164076037, 'gamma': 0.25989221355957337, 'min_child_weight': 1}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:07,404] Trial 87 finished with value: 127.29731654400189 and parameters: {'n_estimators': 798, 'learning_rate': 0.26844306954518526, 'max_depth': 9, 'reg_lambda': 0.2633658232348248, 'subsample': 0.6805990259879263, 'colsample_bytree': 0.8074361001735342, 'gamma': 0.20703546007308973, 'min_child_weight': 2}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:08,137] Trial 88 finished with value: 128.46280575999685 and parameters: {'n_estimators': 927, 'learning_rate': 0.11415885383382905, 'max_depth': 6, 'reg_lambda': 21.525473510806762, 'subsample': 0.7072340108487369, 'colsample_bytree': 0.5829793503225817, 'gamma': 0.2361534423486446, 'min_child_weight': 3}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:08,739] Trial 89 finished with value: 127.28602703872797 and parameters: {'n_estimators': 759, 'learning_rate': 0.1299849657751349, 'max_depth': 8, 'reg_lambda': 3.738917271436467e-07, 'subsample': 0.5945044574324718, 'colsample_bytree': 0.849902122920978, 'gamma': 0.3523559007838637, 'min_child_weight': 2}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:09,291] Trial 90 finished with value: 125.4523784339578 and parameters: {'n_estimators': 658, 'learning_rate': 0.20800052943229816, 'max_depth': 6, 'reg_lambda': 0.45507707057474645, 'subsample': 0.6586164581540709, 'colsample_bytree': 0.7318421729668814, 'gamma': 0.17013744730350414, 'min_child_weight': 1}. Best is trial 50 with value: 124.37741557777078.
[I 2023-08-24 21:38:09,721] Trial 91 finished with value: 123.8956308052582 and parameters: {'n_estimators': 637, 'learning_rate': 0.22583163079394733, 'max_depth': 7, 'reg_lambda': 1.8340878558028564, 'subsample': 0.6267325025967294, 'colsample_bytree': 0.7659296394864177, 'gamma': 0.30762619427723126, 'min_child_weight': 4}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:10,185] Trial 92 finished with value: 127.85876987562972 and parameters: {'n_estimators': 565, 'learning_rate': 0.1873680345617087, 'max_depth': 7, 'reg_lambda': 8.536272933680042, 'subsample': 0.5660131077025174, 'colsample_bytree': 0.7646412761448776, 'gamma': 0.31756755241730106, 'min_child_weight': 4}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:10,644] Trial 93 finished with value: 128.0737366183879 and parameters: {'n_estimators': 737, 'learning_rate': 0.24214788271188034, 'max_depth': 7, 'reg_lambda': 2.764087111070568, 'subsample': 0.6286314070387166, 'colsample_bytree': 0.718368594103462, 'gamma': 0.2678402131961928, 'min_child_weight': 4}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:11,175] Trial 94 finished with value: 124.6392831490082 and parameters: {'n_estimators': 709, 'learning_rate': 0.16573328646523308, 'max_depth': 8, 'reg_lambda': 1.7610801964780194, 'subsample': 0.6814160855107918, 'colsample_bytree': 0.7847244683988754, 'gamma': 0.37243774134266094, 'min_child_weight': 4}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:11,794] Trial 95 finished with value: 126.58350101346032 and parameters: {'n_estimators': 624, 'learning_rate': 0.16293679848061535, 'max_depth': 7, 'reg_lambda': 6.630934995851695, 'subsample': 0.5439611093002767, 'colsample_bytree': 0.756370251624473, 'gamma': 0.430318001886578, 'min_child_weight': 5}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:12,377] Trial 96 finished with value: 126.92371620159005 and parameters: {'n_estimators': 693, 'learning_rate': 0.14674176872561442, 'max_depth': 9, 'reg_lambda': 1.5018518693070424, 'subsample': 0.7044468371861795, 'colsample_bytree': 0.8193694128251744, 'gamma': 0.36822843471770816, 'min_child_weight': 4}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:12,873] Trial 97 finished with value: 126.2421739707966 and parameters: {'n_estimators': 651, 'learning_rate': 0.19165136077417266, 'max_depth': 8, 'reg_lambda': 4.146283402475353, 'subsample': 0.6022489145452287, 'colsample_bytree': 0.6887283555095808, 'gamma': 0.34170181665465577, 'min_child_weight': 5}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:13,640] Trial 98 finished with value: 126.79128251928526 and parameters: {'n_estimators': 720, 'learning_rate': 0.17484553918460102, 'max_depth': 7, 'reg_lambda': 13.335601803812068, 'subsample': 0.6216060636846312, 'colsample_bytree': 0.6389219415297601, 'gamma': 0.22594012123589255, 'min_child_weight': 4}. Best is trial 91 with value: 123.8956308052582.
[I 2023-08-24 21:38:14,077] Trial 99 finished with value: 127.43140816868703 and parameters: {'n_estimators': 590, 'learning_rate': 0.13430577539630337, 'max_depth': 6, 'reg_lambda': 0.7523074930100915, 'subsample': 0.6537035477441075, 'colsample_bytree': 0.7080497097831233, 'gamma': 0.20074564472503695, 'min_child_weight': 1}. Best is trial 91 with value: 123.8956308052582.
Best Hyperparameters: {'n_estimators': 637, 'learning_rate': 0.22583163079394733, 'max_depth': 7, 'reg_lambda': 1.8340878558028564, 'subsample': 0.6267325025967294, 'colsample_bytree': 0.7659296394864177, 'gamma': 0.30762619427723126, 'min_child_weight': 4}
Best Score (MAE): 123.8956308052582

💡 Findings

The best parameters appear to be:


{'n_estimators': 495, 'learning_rate': 0.133354970082411, 'max_depth': 7, 'reg_lambda': 4.53385562551496e-08, 'subsample': 0.8141582328181602, 'colsample_bytree': 0.7732232759096854, 'gamma': 0.34364076800046645, 'min_child_weight': 1}


Feature Engineer Total Model

¶

🧠 A good idea

Let's get a baseline for how well our model is performing.

In [ ]:
X = training.drop(columns= targetColumns  + ['TruckSID'])
y = target_ActualWeightTotal
X_transformed = data_prep_pipeline.fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=1)
model = XGBRegressor()
scores = -cross_val_score(model, X_train, y_train, cv=5, scoring="neg_mean_absolute_error")
print(scores)
print("MAE score: {}".format(np.mean(scores)))
[306.28804635 280.11694204 300.64648174 289.98769003 280.72701119]
MAE score: 291.5532342694257
In [ ]:
X_train_fe = FE(X_train)
X_test_fe = FE(X_test)

model = XGBRegressor(random_state=42) 
model.fit(X_train_fe, y_train)
scores = -cross_val_score(model, X_train_fe, y_train, cv=5, scoring="neg_mean_absolute_error")

print(scores)
print("Mean:", scores.mean())
xgb.plot_importance(model)
plt.show()
[307.162339   278.79971759 297.96208562 289.35171822 288.43561286]
Mean: 292.3422946579392
In [ ]:
feature_importances = model.feature_importances_

# Create a DataFrame for better manipulation
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})
importance_df_sorted =  importance_df.sort_values(by='Importance', ascending=False)
importance_df_sorted.to_csv('importance_df_sorted.csv', index=False)
importance_df_sorted
Out[ ]:
Feature Importance
17 EngineFamily 0.610977
18 TransmissionFamily 0.193234
1 Transmission 0.058836
12 RearWheels 0.033532
16 TagAxle 0.018143
13 RearTires 0.011649
27 sum_WheelBase_per_Engine 0.007636
4 Overhang 0.007139
6 Liner 0.006805
21 WheelBase_squared 0.006624
24 WheelBase_to_Overhang 0.006261
0 Engine 0.006041
3 WheelBase 0.003928
11 FrontSusp 0.003525
28 max_Overhang_per_Transmission 0.003510
15 FrontTires 0.003481
22 Overhang_squared 0.002605
23 Front_to_Rear_Wheels 0.002384
9 RearAxels 0.002239
8 Cab 0.001929
14 FrontWheels 0.001741
19 Engine_Transmission 0.001671
10 RearSusp 0.001582
26 avg_Overhang_per_EngineFamily 0.001364
2 FrontAxlePosition 0.001027
5 FrameRails 0.001025
7 FrontEndExt 0.000698
20 TransmissionFamily_EngineFamily 0.000417
25 avg_WheelBase_per_TransmissionFamily 0.000000
In [ ]:
model = xgb.train({"learning_rate": 0.1}, xgb.DMatrix(X_train_fe, label=y_train), 100)

# Specifically for XGBoost, you can use the TreeExplainer which works best for tree-based models
explainer = shap.TreeExplainer(model)

# Compute SHAP values for a sample of the test set
shap_values = explainer.shap_values(X_test_fe)

# Visualize the first prediction's explanation
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0], X_test_fe.iloc[0])

shap.summary_plot(shap_values, X_test_fe)
# For a dependence plot for a specific feature 
for i in range(5):
    feature_name = X_test.columns[i]  # get the name of the first feature
    shap.dependence_plot(feature_name, shap_values, X_test_fe)
In [ ]:
selected_features = importance_df[importance_df['Importance'] >= 0.0001]['Feature'].tolist()

# Subset the dataset
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
In [ ]:
model = XGBRegressor(random_state=42)
model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)
scores = -cross_val_score(model, X_train_selected, y_train, cv=5, scoring="neg_mean_absolute_error")

print(scores)
print("Mean:", scores.mean())
[307.162339   278.79971759 297.96208562 289.35171822 288.43561286]
Mean: 292.3422946579392
In [ ]:
pca = PCA(n_components=0.95)  # Retain 95% of the variance
X_train_pca = pca.fit_transform(X_train_selected)
X_val_pca = pca.transform(X_test_selected)
In [ ]:
model = XGBRegressor(random_state=42)  # or xgb.XGBClassifier() for classification
model.fit(X_train_pca, y_train)
y_pred = model.predict(X_val_pca)
scores = -cross_val_score(model, X_train_selected, y_train, cv=5, scoring="neg_mean_absolute_error")

print(scores)
print("Mean:", scores.mean())
[307.162339   278.79971759 297.96208562 289.35171822 288.43561286]
Mean: 292.3422946579392

💡 Findings

I was unsuccessful in creating a useful combination of features, adding extra dimensions didn't help the model and reducing existing dimensions didn't help either.

In [ ]:
X_train_selected.to_csv('X_train_selected_total.csv', index=False)
In [ ]:
X = X_transformed
y = target_ActualWeightTotal

def objective(trial):
    # Define the hyperparameter search space
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 10, 1000),
        "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
        "max_depth": trial.suggest_int("max_depth", 1, 10),
        "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-9, 100),
        "subsample": trial.suggest_float("subsample", 0.1, 1),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.1, 1),
        "gamma": trial.suggest_float("gamma", 0, 1),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
    }

    # Split the data into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

    # Create and train the XGBRegressor with the suggested hyperparameters
    model = XGBRegressor(**params, eval_metric='mae',n_jobs=1,random_state=42)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, verbose=False)

    # Calculate the MAE on the validation set
    y_pred = model.predict(X_val)
    mae = mean_absolute_error(y_val, y_pred)

    return mae

# Create a study object and specify that the goal is to minimize the objective function
study = optuna.create_study(direction="minimize")  # We want to minimize the MAE
study.optimize(objective, n_trials=100)

# Get the best hyperparameters and their corresponding MAE
best_params = study.best_params
best_score = study.best_value

print("Best Hyperparameters:", best_params)
print("Best Score (MAE):", best_score)
[I 2023-08-24 21:38:24,047] A new study created in memory with name: no-name-ca58702b-b89b-4fce-89e2-c279cd59fb1b
[I 2023-08-24 21:38:24,223] Trial 0 finished with value: 503.87475155462846 and parameters: {'n_estimators': 183, 'learning_rate': 0.03460310884107015, 'max_depth': 1, 'reg_lambda': 8.459065667850058e-07, 'subsample': 0.52198747648067, 'colsample_bytree': 0.1323594584088897, 'gamma': 0.7020439976815452, 'min_child_weight': 8}. Best is trial 0 with value: 503.87475155462846.
[I 2023-08-24 21:38:24,538] Trial 1 finished with value: 307.60744770347924 and parameters: {'n_estimators': 266, 'learning_rate': 0.08582965349014518, 'max_depth': 4, 'reg_lambda': 1.5603737076552077e-06, 'subsample': 0.2702208246996484, 'colsample_bytree': 0.8367524461810938, 'gamma': 0.8732546095813712, 'min_child_weight': 4}. Best is trial 1 with value: 307.60744770347924.
[I 2023-08-24 21:38:25,186] Trial 2 finished with value: 343.12041482997483 and parameters: {'n_estimators': 457, 'learning_rate': 0.01669704051905962, 'max_depth': 3, 'reg_lambda': 0.007347039573603694, 'subsample': 0.2677775949077245, 'colsample_bytree': 0.8183081675138126, 'gamma': 0.6999865133216777, 'min_child_weight': 7}. Best is trial 1 with value: 307.60744770347924.
[I 2023-08-24 21:38:26,884] Trial 3 finished with value: 320.6799666935611 and parameters: {'n_estimators': 926, 'learning_rate': 0.011794523449264661, 'max_depth': 3, 'reg_lambda': 0.000245475032093513, 'subsample': 0.52867131847601, 'colsample_bytree': 0.970469194695121, 'gamma': 0.36875815825087366, 'min_child_weight': 6}. Best is trial 1 with value: 307.60744770347924.
[I 2023-08-24 21:38:27,700] Trial 4 finished with value: 328.6597552936083 and parameters: {'n_estimators': 657, 'learning_rate': 0.015540552013706887, 'max_depth': 3, 'reg_lambda': 1.5537288821449102e-08, 'subsample': 0.8653645697399861, 'colsample_bytree': 0.6009729167657036, 'gamma': 0.3507613787063616, 'min_child_weight': 1}. Best is trial 1 with value: 307.60744770347924.
[I 2023-08-24 21:38:28,210] Trial 5 finished with value: 291.5666854435611 and parameters: {'n_estimators': 407, 'learning_rate': 0.08658553956738145, 'max_depth': 4, 'reg_lambda': 4.7345639085361925e-07, 'subsample': 0.9529558015145848, 'colsample_bytree': 0.3024623277710474, 'gamma': 0.9482554848421325, 'min_child_weight': 4}. Best is trial 5 with value: 291.5666854435611.
[I 2023-08-24 21:38:28,708] Trial 6 finished with value: 346.3486291227172 and parameters: {'n_estimators': 408, 'learning_rate': 0.050360688497556205, 'max_depth': 6, 'reg_lambda': 30.161356874006668, 'subsample': 0.8617818473208314, 'colsample_bytree': 0.14869488509115344, 'gamma': 0.5853643113687939, 'min_child_weight': 9}. Best is trial 5 with value: 291.5666854435611.
[I 2023-08-24 21:38:29,158] Trial 7 finished with value: 469.22744189822106 and parameters: {'n_estimators': 360, 'learning_rate': 0.01205245158400343, 'max_depth': 3, 'reg_lambda': 2.245099789645699e-07, 'subsample': 0.9502770181408795, 'colsample_bytree': 0.23026170094468384, 'gamma': 0.9991234103420932, 'min_child_weight': 6}. Best is trial 5 with value: 291.5666854435611.
[I 2023-08-24 21:38:29,237] Trial 8 finished with value: 3861.7185298429627 and parameters: {'n_estimators': 36, 'learning_rate': 0.04185734554297638, 'max_depth': 5, 'reg_lambda': 1.8922810517443854e-07, 'subsample': 0.7328108166709979, 'colsample_bytree': 0.3805326370500097, 'gamma': 0.8220061575608637, 'min_child_weight': 10}. Best is trial 5 with value: 291.5666854435611.
[I 2023-08-24 21:38:30,167] Trial 9 finished with value: 297.8410503089578 and parameters: {'n_estimators': 578, 'learning_rate': 0.03274400221791503, 'max_depth': 4, 'reg_lambda': 7.308456985962609e-07, 'subsample': 0.38790819667753573, 'colsample_bytree': 0.44358230100438323, 'gamma': 0.7147952401286152, 'min_child_weight': 9}. Best is trial 5 with value: 291.5666854435611.
[I 2023-08-24 21:38:30,690] Trial 10 finished with value: 287.42760892238664 and parameters: {'n_estimators': 718, 'learning_rate': 0.22933266083251272, 'max_depth': 9, 'reg_lambda': 1.6633234757988981e-09, 'subsample': 0.7049368159235926, 'colsample_bytree': 0.31842864185601794, 'gamma': 0.04826145816750316, 'min_child_weight': 3}. Best is trial 10 with value: 287.42760892238664.
[I 2023-08-24 21:38:31,141] Trial 11 finished with value: 285.3675060512437 and parameters: {'n_estimators': 772, 'learning_rate': 0.2600216176334391, 'max_depth': 10, 'reg_lambda': 1.1270623281253282e-09, 'subsample': 0.7310007031926059, 'colsample_bytree': 0.29059012407380425, 'gamma': 0.011619588168750201, 'min_child_weight': 3}. Best is trial 11 with value: 285.3675060512437.
[I 2023-08-24 21:38:31,659] Trial 12 finished with value: 281.53009263814545 and parameters: {'n_estimators': 802, 'learning_rate': 0.2868624965544731, 'max_depth': 10, 'reg_lambda': 1.2858055748963844e-09, 'subsample': 0.6929442034265967, 'colsample_bytree': 0.4779897434807927, 'gamma': 0.07944427798182607, 'min_child_weight': 2}. Best is trial 12 with value: 281.53009263814545.
[I 2023-08-24 21:38:32,233] Trial 13 finished with value: 283.4363463279282 and parameters: {'n_estimators': 886, 'learning_rate': 0.27348498062253, 'max_depth': 10, 'reg_lambda': 1.1067329544854548e-09, 'subsample': 0.6660951503139101, 'colsample_bytree': 0.5241730085569677, 'gamma': 0.009341115633953123, 'min_child_weight': 1}. Best is trial 12 with value: 281.53009263814545.
[I 2023-08-24 21:38:32,530] Trial 14 finished with value: 276.23720752322106 and parameters: {'n_estimators': 998, 'learning_rate': 0.29915511444217346, 'max_depth': 8, 'reg_lambda': 1.0411830447870452e-09, 'subsample': 0.6306025795570318, 'colsample_bytree': 0.5362570698131197, 'gamma': 0.11336897948068983, 'min_child_weight': 1}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:32,779] Trial 15 finished with value: 311.20728461508185 and parameters: {'n_estimators': 983, 'learning_rate': 0.18033823697611487, 'max_depth': 8, 'reg_lambda': 1.6906308168756473e-08, 'subsample': 0.11913104692330123, 'colsample_bytree': 0.59110148979193, 'gamma': 0.16026130463117214, 'min_child_weight': 2}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:33,305] Trial 16 finished with value: 278.1858149992128 and parameters: {'n_estimators': 825, 'learning_rate': 0.1440033863673198, 'max_depth': 7, 'reg_lambda': 2.3297585797411725e-05, 'subsample': 0.5886233633918191, 'colsample_bytree': 0.48539098141685694, 'gamma': 0.17946885570146187, 'min_child_weight': 1}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:33,873] Trial 17 finished with value: 281.16425436870276 and parameters: {'n_estimators': 998, 'learning_rate': 0.1551142492739507, 'max_depth': 7, 'reg_lambda': 3.9446199422950445e-05, 'subsample': 0.586356888319233, 'colsample_bytree': 0.6430166238301531, 'gamma': 0.22251866811655746, 'min_child_weight': 4}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:34,333] Trial 18 finished with value: 285.17326850795024 and parameters: {'n_estimators': 851, 'learning_rate': 0.1420885585516838, 'max_depth': 7, 'reg_lambda': 2.6325353995211063e-05, 'subsample': 0.45727001646082466, 'colsample_bytree': 0.4227026399606507, 'gamma': 0.1776967524919919, 'min_child_weight': 1}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:34,879] Trial 19 finished with value: 278.1543854297859 and parameters: {'n_estimators': 587, 'learning_rate': 0.10801186149526924, 'max_depth': 8, 'reg_lambda': 0.005418123253574687, 'subsample': 0.6040591446928526, 'colsample_bytree': 0.6754565660864857, 'gamma': 0.2991834812471706, 'min_child_weight': 2}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:35,314] Trial 20 finished with value: 279.3540641727015 and parameters: {'n_estimators': 586, 'learning_rate': 0.10741911185263471, 'max_depth': 8, 'reg_lambda': 0.0032654706491795143, 'subsample': 0.6287943831753671, 'colsample_bytree': 0.6651570664150814, 'gamma': 0.3254494245690571, 'min_child_weight': 5}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:35,704] Trial 21 finished with value: 279.09550510665935 and parameters: {'n_estimators': 688, 'learning_rate': 0.180524699839149, 'max_depth': 8, 'reg_lambda': 0.025498723376927247, 'subsample': 0.5962268645993603, 'colsample_bytree': 0.5193985174899024, 'gamma': 0.26935666589792595, 'min_child_weight': 2}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:36,127] Trial 22 finished with value: 279.2226587098552 and parameters: {'n_estimators': 546, 'learning_rate': 0.11594188615771885, 'max_depth': 7, 'reg_lambda': 0.06382145996196928, 'subsample': 0.4889828189974613, 'colsample_bytree': 0.6893361108761109, 'gamma': 0.13851497049769423, 'min_child_weight': 1}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:36,461] Trial 23 finished with value: 282.2157748051795 and parameters: {'n_estimators': 803, 'learning_rate': 0.20950291448791833, 'max_depth': 6, 'reg_lambda': 9.010695437000935e-06, 'subsample': 0.5916354416847946, 'colsample_bytree': 0.5362615138704698, 'gamma': 0.2642117289027355, 'min_child_weight': 3}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:37,069] Trial 24 finished with value: 280.09062106423175 and parameters: {'n_estimators': 935, 'learning_rate': 0.141507559008247, 'max_depth': 9, 'reg_lambda': 0.0004559781334346122, 'subsample': 0.7874516798569382, 'colsample_bytree': 0.7302482812736135, 'gamma': 0.4402891737261523, 'min_child_weight': 2}. Best is trial 14 with value: 276.23720752322106.
[I 2023-08-24 21:38:37,528] Trial 25 finished with value: 275.5025570194427 and parameters: {'n_estimators': 623, 'learning_rate': 0.19561243660747582, 'max_depth': 9, 'reg_lambda': 0.17273368678437587, 'subsample': 0.6377773512823068, 'colsample_bytree': 0.5801202643323335, 'gamma': 0.12233733494713053, 'min_child_weight': 1}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:37,980] Trial 26 finished with value: 277.9574457355951 and parameters: {'n_estimators': 614, 'learning_rate': 0.217000386191815, 'max_depth': 9, 'reg_lambda': 0.268216139801137, 'subsample': 0.6552491880379386, 'colsample_bytree': 0.5427747713022606, 'gamma': 0.09094291190515338, 'min_child_weight': 3}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:38,460] Trial 27 finished with value: 281.32694918923175 and parameters: {'n_estimators': 645, 'learning_rate': 0.20896863980045055, 'max_depth': 9, 'reg_lambda': 0.357614454303984, 'subsample': 0.6559706797109693, 'colsample_bytree': 0.5674021290332665, 'gamma': 0.10079674419032192, 'min_child_weight': 3}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:38,925] Trial 28 finished with value: 281.68758609493074 and parameters: {'n_estimators': 494, 'learning_rate': 0.26667080219581796, 'max_depth': 9, 'reg_lambda': 0.3955741076271662, 'subsample': 0.8035261084722158, 'colsample_bytree': 0.4085984459335157, 'gamma': 0.09049864514427342, 'min_child_weight': 5}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:39,425] Trial 29 finished with value: 275.6900705486461 and parameters: {'n_estimators': 282, 'learning_rate': 0.20936031602434668, 'max_depth': 10, 'reg_lambda': 3.0367839991884775, 'subsample': 0.5322684529622277, 'colsample_bytree': 0.5876945827454486, 'gamma': 0.23307106313059275, 'min_child_weight': 2}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:39,601] Trial 30 finished with value: 401.44545640152705 and parameters: {'n_estimators': 177, 'learning_rate': 0.17883057170141184, 'max_depth': 1, 'reg_lambda': 4.643900813619784, 'subsample': 0.540534835334475, 'colsample_bytree': 0.6246675512139492, 'gamma': 0.21824306753568923, 'min_child_weight': 1}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:40,344] Trial 31 finished with value: 277.867999252204 and parameters: {'n_estimators': 239, 'learning_rate': 0.2899816404555501, 'max_depth': 10, 'reg_lambda': 96.30202645971627, 'subsample': 0.6506099735441224, 'colsample_bytree': 0.5510752754114139, 'gamma': 0.11159469814966103, 'min_child_weight': 2}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:40,693] Trial 32 finished with value: 284.09638450488035 and parameters: {'n_estimators': 261, 'learning_rate': 0.2850161025138182, 'max_depth': 10, 'reg_lambda': 13.225471472498723, 'subsample': 0.48507616764254324, 'colsample_bytree': 0.4786306016311545, 'gamma': 0.1478015903765593, 'min_child_weight': 2}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:41,181] Trial 33 finished with value: 284.02449769757555 and parameters: {'n_estimators': 154, 'learning_rate': 0.22148739899871925, 'max_depth': 10, 'reg_lambda': 75.89645312218785, 'subsample': 0.5460212213871909, 'colsample_bytree': 0.5929846413449087, 'gamma': 0.23003173012123695, 'min_child_weight': 2}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:41,542] Trial 34 finished with value: 276.2260803683879 and parameters: {'n_estimators': 297, 'learning_rate': 0.25076774650315714, 'max_depth': 9, 'reg_lambda': 2.65428437367735, 'subsample': 0.44007934933988513, 'colsample_bytree': 0.721766422438874, 'gamma': 0.0071927918546931535, 'min_child_weight': 4}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:41,990] Trial 35 finished with value: 282.4661253345403 and parameters: {'n_estimators': 350, 'learning_rate': 0.1767013516396914, 'max_depth': 9, 'reg_lambda': 3.1742554253117876, 'subsample': 0.4165778181819284, 'colsample_bytree': 0.7764575819410393, 'gamma': 0.020781583381783625, 'min_child_weight': 7}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:42,273] Trial 36 finished with value: 290.15312721386965 and parameters: {'n_estimators': 103, 'learning_rate': 0.07721483024568497, 'max_depth': 8, 'reg_lambda': 1.3370912181105128, 'subsample': 0.3737258696517997, 'colsample_bytree': 0.717477720803216, 'gamma': 0.05810721329475632, 'min_child_weight': 4}. Best is trial 25 with value: 275.5025570194427.
[I 2023-08-24 21:38:42,647] Trial 37 finished with value: 272.37475032470087 and parameters: {'n_estimators': 305, 'learning_rate': 0.2381898419257861, 'max_depth': 9, 'reg_lambda': 1.9065146239404211, 'subsample': 0.5221174207401648, 'colsample_bytree': 0.8762984254488593, 'gamma': 0.004010513787127894, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:43,136] Trial 38 finished with value: 280.12048001613664 and parameters: {'n_estimators': 307, 'learning_rate': 0.21337407547227186, 'max_depth': 9, 'reg_lambda': 11.547755675294265, 'subsample': 0.5183498856410114, 'colsample_bytree': 0.8963812516815987, 'gamma': 0.0451076267933046, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:43,433] Trial 39 finished with value: 371.4022896331864 and parameters: {'n_estimators': 443, 'learning_rate': 0.24336880351803475, 'max_depth': 1, 'reg_lambda': 1.6507534271268642, 'subsample': 0.4386022675478367, 'colsample_bytree': 0.8592784742251566, 'gamma': 0.011470087351944016, 'min_child_weight': 5}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:43,858] Trial 40 finished with value: 280.6790762259918 and parameters: {'n_estimators': 220, 'learning_rate': 0.16100496028900804, 'max_depth': 5, 'reg_lambda': 24.550129422102575, 'subsample': 0.33417657433368564, 'colsample_bytree': 0.7689380635912174, 'gamma': 0.40209068686922367, 'min_child_weight': 6}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:44,252] Trial 41 finished with value: 277.0701698284005 and parameters: {'n_estimators': 338, 'learning_rate': 0.25217257527915415, 'max_depth': 8, 'reg_lambda': 0.9693110898244365, 'subsample': 0.5153604543915742, 'colsample_bytree': 0.9952695289511038, 'gamma': 0.12781663359684567, 'min_child_weight': 1}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:44,590] Trial 42 finished with value: 278.73267155029913 and parameters: {'n_estimators': 109, 'learning_rate': 0.20078413896214647, 'max_depth': 9, 'reg_lambda': 0.10674894137170153, 'subsample': 0.5508664522137815, 'colsample_bytree': 0.594777992016356, 'gamma': 0.06605277698912926, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:45,015] Trial 43 finished with value: 282.07098158060455 and parameters: {'n_estimators': 400, 'learning_rate': 0.24657782447673263, 'max_depth': 10, 'reg_lambda': 6.103356665635787, 'subsample': 0.4809487764376245, 'colsample_bytree': 0.621151949442091, 'gamma': 0.0018296475595689227, 'min_child_weight': 7}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:45,374] Trial 44 finished with value: 275.9902749626102 and parameters: {'n_estimators': 293, 'learning_rate': 0.29054059816715466, 'max_depth': 9, 'reg_lambda': 2.4690770909365156, 'subsample': 0.5608948576972114, 'colsample_bytree': 0.8219089347358931, 'gamma': 0.18527725422289396, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:45,911] Trial 45 finished with value: 278.2358324641845 and parameters: {'n_estimators': 503, 'learning_rate': 0.18885857508387224, 'max_depth': 10, 'reg_lambda': 2.885304058807542, 'subsample': 0.44657598106617113, 'colsample_bytree': 0.9350757762228091, 'gamma': 0.19604221125315546, 'min_child_weight': 5}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:46,262] Trial 46 finished with value: 273.3197491931675 and parameters: {'n_estimators': 303, 'learning_rate': 0.2381434832850326, 'max_depth': 9, 'reg_lambda': 0.9170187868915959, 'subsample': 0.5665630707800245, 'colsample_bytree': 0.8207941113034839, 'gamma': 0.1644446202191974, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:46,805] Trial 47 finished with value: 275.7896135075567 and parameters: {'n_estimators': 448, 'learning_rate': 0.12639956482211503, 'max_depth': 10, 'reg_lambda': 0.6496140036210473, 'subsample': 0.5520082526060971, 'colsample_bytree': 0.8485012737783704, 'gamma': 0.26647701245627253, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:47,370] Trial 48 finished with value: 280.731460071631 and parameters: {'n_estimators': 437, 'learning_rate': 0.13011883130369223, 'max_depth': 10, 'reg_lambda': 0.6341484468576574, 'subsample': 0.5163982502138322, 'colsample_bytree': 0.8669448200804477, 'gamma': 0.35247544961367067, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:47,794] Trial 49 finished with value: 312.37157096190174 and parameters: {'n_estimators': 390, 'learning_rate': 0.17056927086730603, 'max_depth': 2, 'reg_lambda': 0.09748189311358584, 'subsample': 0.6967247520079654, 'colsample_bytree': 0.9355849762069983, 'gamma': 0.29175686180344945, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:48,374] Trial 50 finished with value: 279.9216376239767 and parameters: {'n_estimators': 477, 'learning_rate': 0.15908176645190095, 'max_depth': 10, 'reg_lambda': 10.968270358718243, 'subsample': 0.563529129991677, 'colsample_bytree': 0.7800248824971729, 'gamma': 0.16169786751097523, 'min_child_weight': 6}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:49,090] Trial 51 finished with value: 276.920177699937 and parameters: {'n_estimators': 300, 'learning_rate': 0.19396407129958174, 'max_depth': 9, 'reg_lambda': 1.1947078614301847, 'subsample': 0.6153196415320673, 'colsample_bytree': 0.8205224942916335, 'gamma': 0.2369218093813788, 'min_child_weight': 5}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:49,513] Trial 52 finished with value: 278.9103173705132 and parameters: {'n_estimators': 534, 'learning_rate': 0.23405464391550176, 'max_depth': 9, 'reg_lambda': 0.2587732282051522, 'subsample': 0.5867201264325743, 'colsample_bytree': 0.8170601578030543, 'gamma': 0.2051491288034059, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:50,006] Trial 53 finished with value: 280.68965360319584 and parameters: {'n_estimators': 221, 'learning_rate': 0.29977928938080906, 'max_depth': 10, 'reg_lambda': 0.7368019200474047, 'subsample': 0.5605163883806396, 'colsample_bytree': 0.914299865442843, 'gamma': 0.1729820449064228, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:50,479] Trial 54 finished with value: 280.515519226228 and parameters: {'n_estimators': 370, 'learning_rate': 0.2270103421041519, 'max_depth': 8, 'reg_lambda': 6.735108579951337, 'subsample': 0.5192029089464596, 'colsample_bytree': 0.8599881758342562, 'gamma': 0.25368036912755343, 'min_child_weight': 6}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:51,096] Trial 55 finished with value: 280.62058947969143 and parameters: {'n_estimators': 328, 'learning_rate': 0.2005036917735287, 'max_depth': 9, 'reg_lambda': 37.987044795428524, 'subsample': 0.6182122298207031, 'colsample_bytree': 0.9717514784598534, 'gamma': 0.3131643481474765, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:51,527] Trial 56 finished with value: 280.691068019915 and parameters: {'n_estimators': 274, 'learning_rate': 0.26578042327756657, 'max_depth': 7, 'reg_lambda': 0.025467347544414282, 'subsample': 0.5705724575600597, 'colsample_bytree': 0.8094067515837664, 'gamma': 0.13535312221925078, 'min_child_weight': 5}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:51,980] Trial 57 finished with value: 276.5312536897828 and parameters: {'n_estimators': 434, 'learning_rate': 0.16033977873245345, 'max_depth': 6, 'reg_lambda': 0.16580531822920738, 'subsample': 0.49462009150775604, 'colsample_bytree': 0.875006180201521, 'gamma': 0.19996875764557964, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:52,765] Trial 58 finished with value: 277.600112415381 and parameters: {'n_estimators': 723, 'learning_rate': 0.12886378456576864, 'max_depth': 8, 'reg_lambda': 0.5113504925817178, 'subsample': 0.6780280202771731, 'colsample_bytree': 0.830722786112942, 'gamma': 0.2714369694016817, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:53,337] Trial 59 finished with value: 276.7184363684666 and parameters: {'n_estimators': 548, 'learning_rate': 0.1939476193602629, 'max_depth': 10, 'reg_lambda': 1.8322501897740024, 'subsample': 0.6316176792745133, 'colsample_bytree': 0.7663258801674331, 'gamma': 0.057917606483270684, 'min_child_weight': 7}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:53,431] Trial 60 finished with value: 344.3452628109257 and parameters: {'n_estimators': 34, 'learning_rate': 0.2396604851252634, 'max_depth': 4, 'reg_lambda': 19.120124180084584, 'subsample': 0.7245851051448065, 'colsample_bytree': 0.6586506874465721, 'gamma': 0.11174226855939313, 'min_child_weight': 10}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:53,806] Trial 61 finished with value: 276.30949159713475 and parameters: {'n_estimators': 311, 'learning_rate': 0.2604652597224021, 'max_depth': 9, 'reg_lambda': 2.6265534445482395, 'subsample': 0.4646861384344494, 'colsample_bytree': 0.6995432817694202, 'gamma': 0.0376418404015602, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:54,324] Trial 62 finished with value: 278.39791109099497 and parameters: {'n_estimators': 257, 'learning_rate': 0.22775122304063786, 'max_depth': 9, 'reg_lambda': 4.5376733552435615, 'subsample': 0.5337099257377079, 'colsample_bytree': 0.7346200213351805, 'gamma': 0.07798271611081192, 'min_child_weight': 4}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:54,778] Trial 63 finished with value: 278.53257955171597 and parameters: {'n_estimators': 204, 'learning_rate': 0.2998830687192381, 'max_depth': 10, 'reg_lambda': 0.8547176909335645, 'subsample': 0.5705780825444046, 'colsample_bytree': 0.8432622020953461, 'gamma': 0.15308439016523745, 'min_child_weight': 5}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:55,155] Trial 64 finished with value: 280.69599510980794 and parameters: {'n_estimators': 136, 'learning_rate': 0.2696551523598431, 'max_depth': 8, 'reg_lambda': 0.05559420946866001, 'subsample': 0.4980412278037374, 'colsample_bytree': 0.8028210058065437, 'gamma': 0.08447852354920836, 'min_child_weight': 9}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:55,620] Trial 65 finished with value: 278.3085900602172 and parameters: {'n_estimators': 287, 'learning_rate': 0.17264878031439726, 'max_depth': 9, 'reg_lambda': 0.25195206569271933, 'subsample': 0.6036516408069978, 'colsample_bytree': 0.648795891292979, 'gamma': 0.035072697553006946, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:56,066] Trial 66 finished with value: 275.2802119411209 and parameters: {'n_estimators': 370, 'learning_rate': 0.24421624392569272, 'max_depth': 10, 'reg_lambda': 6.476931872900835, 'subsample': 0.46548126943654394, 'colsample_bytree': 0.748356747847216, 'gamma': 0.1768609043088442, 'min_child_weight': 2}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:56,687] Trial 67 finished with value: 276.49561530817067 and parameters: {'n_estimators': 365, 'learning_rate': 0.20975576255138204, 'max_depth': 10, 'reg_lambda': 8.09185023167126, 'subsample': 0.5330742753603226, 'colsample_bytree': 0.7401663293480729, 'gamma': 0.19267392689791513, 'min_child_weight': 2}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:57,493] Trial 68 finished with value: 279.0145881710485 and parameters: {'n_estimators': 418, 'learning_rate': 0.14753215782247736, 'max_depth': 10, 'reg_lambda': 36.574045919385604, 'subsample': 0.5793408093035645, 'colsample_bytree': 0.8967608390724144, 'gamma': 0.2298237081623546, 'min_child_weight': 1}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:57,939] Trial 69 finished with value: 277.97393168490237 and parameters: {'n_estimators': 467, 'learning_rate': 0.1859597984694104, 'max_depth': 10, 'reg_lambda': 0.5688452516316728, 'subsample': 0.47386602687237295, 'colsample_bytree': 0.8411797130159789, 'gamma': 0.12957135233406117, 'min_child_weight': 2}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:58,449] Trial 70 finished with value: 278.87633078164356 and parameters: {'n_estimators': 382, 'learning_rate': 0.22446331050966178, 'max_depth': 9, 'reg_lambda': 4.999293896368495, 'subsample': 0.6244906487431542, 'colsample_bytree': 0.6724936876150527, 'gamma': 0.17943828745426213, 'min_child_weight': 8}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:58,804] Trial 71 finished with value: 273.32235171993074 and parameters: {'n_estimators': 331, 'learning_rate': 0.25292176115681114, 'max_depth': 9, 'reg_lambda': 2.040776371516244, 'subsample': 0.44933132543947985, 'colsample_bytree': 0.697632301572057, 'gamma': 0.11003329443230392, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:59,216] Trial 72 finished with value: 273.0269132753464 and parameters: {'n_estimators': 332, 'learning_rate': 0.2720516988038279, 'max_depth': 9, 'reg_lambda': 1.5673274956831946, 'subsample': 0.5004402696324488, 'colsample_bytree': 0.6886060600229759, 'gamma': 0.10616719767043592, 'min_child_weight': 3}. Best is trial 37 with value: 272.37475032470087.
[I 2023-08-24 21:38:59,615] Trial 73 finished with value: 271.38115701747483 and parameters: {'n_estimators': 341, 'learning_rate': 0.2570556611802518, 'max_depth': 10, 'reg_lambda': 1.2436839685308987, 'subsample': 0.46378166789894804, 'colsample_bytree': 0.6956806584041942, 'gamma': 0.12525719229146448, 'min_child_weight': 2}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:00,085] Trial 74 finished with value: 277.38662158572106 and parameters: {'n_estimators': 332, 'learning_rate': 0.2618987838958368, 'max_depth': 9, 'reg_lambda': 15.747295512252679, 'subsample': 0.4086118948092844, 'colsample_bytree': 0.6901931626640395, 'gamma': 0.10895990733481478, 'min_child_weight': 2}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:00,395] Trial 75 finished with value: 276.73891466270464 and parameters: {'n_estimators': 242, 'learning_rate': 0.2317299999763361, 'max_depth': 8, 'reg_lambda': 1.7083895351429395, 'subsample': 0.4606362096025495, 'colsample_bytree': 0.6454192030733011, 'gamma': 0.07762021516038076, 'min_child_weight': 2}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:00,851] Trial 76 finished with value: 272.63177075133814 and parameters: {'n_estimators': 345, 'learning_rate': 0.2095150106503206, 'max_depth': 10, 'reg_lambda': 7.352843140606574, 'subsample': 0.4993525712050281, 'colsample_bytree': 0.5678296111168029, 'gamma': 0.1486634801438333, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:01,229] Trial 77 finished with value: 274.3205793942853 and parameters: {'n_estimators': 352, 'learning_rate': 0.263078021990381, 'max_depth': 10, 'reg_lambda': 7.845257048289091, 'subsample': 0.4969711434932156, 'colsample_bytree': 0.6294055515971203, 'gamma': 0.14182710091773962, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:01,878] Trial 78 finished with value: 278.7768751475913 and parameters: {'n_estimators': 332, 'learning_rate': 0.2652613726156818, 'max_depth': 10, 'reg_lambda': 48.43127167247482, 'subsample': 0.4992120706329156, 'colsample_bytree': 0.6259854631386206, 'gamma': 0.16182496492910714, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:02,283] Trial 79 finished with value: 279.3269282804628 and parameters: {'n_estimators': 356, 'learning_rate': 0.24579136108561317, 'max_depth': 10, 'reg_lambda': 10.103744253525749, 'subsample': 0.42090434950837374, 'colsample_bytree': 0.6928602016606208, 'gamma': 0.10044304317745088, 'min_child_weight': 3}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:03,058] Trial 80 finished with value: 277.69627061358625 and parameters: {'n_estimators': 413, 'learning_rate': 0.27798203799959964, 'max_depth': 10, 'reg_lambda': 22.745069322038372, 'subsample': 0.45791378349219974, 'colsample_bytree': 0.7454326633554741, 'gamma': 0.042544109852274734, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:03,492] Trial 81 finished with value: 273.3138566101228 and parameters: {'n_estimators': 185, 'learning_rate': 0.21268464596233164, 'max_depth': 9, 'reg_lambda': 5.777670982375906, 'subsample': 0.47629347247967335, 'colsample_bytree': 0.5681639218994864, 'gamma': 0.1411599347843201, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:03,939] Trial 82 finished with value: 274.84975450645464 and parameters: {'n_estimators': 178, 'learning_rate': 0.21544079030424487, 'max_depth': 9, 'reg_lambda': 6.2634201220314845, 'subsample': 0.49352189672070906, 'colsample_bytree': 0.5582464826168826, 'gamma': 0.14408931798994673, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:04,519] Trial 83 finished with value: 280.9565294395466 and parameters: {'n_estimators': 180, 'learning_rate': 0.2116284669916624, 'max_depth': 9, 'reg_lambda': 63.35506813992003, 'subsample': 0.5037962276182846, 'colsample_bytree': 0.6153697372190052, 'gamma': 0.1433919958208231, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:04,861] Trial 84 finished with value: 278.47587497048175 and parameters: {'n_estimators': 93, 'learning_rate': 0.18249917800062776, 'max_depth': 9, 'reg_lambda': 1.2777079510533818, 'subsample': 0.4890350430784967, 'colsample_bytree': 0.5704969571043933, 'gamma': 0.06744215095494857, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:05,219] Trial 85 finished with value: 280.32431714420653 and parameters: {'n_estimators': 249, 'learning_rate': 0.22138049806343044, 'max_depth': 5, 'reg_lambda': 18.785482335330038, 'subsample': 0.4338373186930039, 'colsample_bytree': 0.5512246110927226, 'gamma': 0.11309111415838193, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:05,498] Trial 86 finished with value: 276.20092957926636 and parameters: {'n_estimators': 156, 'learning_rate': 0.27745835831625093, 'max_depth': 8, 'reg_lambda': 3.8747812301855515, 'subsample': 0.40013898041541524, 'colsample_bytree': 0.5245922972675459, 'gamma': 0.14609357061510322, 'min_child_weight': 1}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:05,720] Trial 87 finished with value: 285.44692862484254 and parameters: {'n_estimators': 77, 'learning_rate': 0.2031602205093676, 'max_depth': 7, 'reg_lambda': 9.407830808645649, 'subsample': 0.43484863334767204, 'colsample_bytree': 0.6080486624078841, 'gamma': 0.02471123226172514, 'min_child_weight': 2}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:06,329] Trial 88 finished with value: 281.3319205565176 and parameters: {'n_estimators': 208, 'learning_rate': 0.1666949035699387, 'max_depth': 8, 'reg_lambda': 30.875704032065002, 'subsample': 0.3768758500117065, 'colsample_bytree': 0.6309724328962547, 'gamma': 0.08366715208850783, 'min_child_weight': 2}. Best is trial 73 with value: 271.38115701747483.
[I 2023-08-24 21:39:06,709] Trial 89 finished with value: 269.2243781486146 and parameters: {'n_estimators': 315, 'learning_rate': 0.29982249490952867, 'max_depth': 9, 'reg_lambda': 3.8209153905781825, 'subsample': 0.4760189439492335, 'colsample_bytree': 0.7098440970336479, 'gamma': 0.209688895363623, 'min_child_weight': 1}. Best is trial 89 with value: 269.2243781486146.
[I 2023-08-24 21:39:07,066] Trial 90 finished with value: 277.6214528888539 and parameters: {'n_estimators': 273, 'learning_rate': 0.29316623131900954, 'max_depth': 9, 'reg_lambda': 1.8704233342146532, 'subsample': 0.5162545925081636, 'colsample_bytree': 0.6659807312716542, 'gamma': 0.21314061070936458, 'min_child_weight': 3}. Best is trial 89 with value: 269.2243781486146.
[I 2023-08-24 21:39:07,445] Trial 91 finished with value: 274.58950060020464 and parameters: {'n_estimators': 313, 'learning_rate': 0.25057031221364373, 'max_depth': 9, 'reg_lambda': 4.929872027258234, 'subsample': 0.4838363148816927, 'colsample_bytree': 0.7063505962722701, 'gamma': 0.12587747764812254, 'min_child_weight': 1}. Best is trial 89 with value: 269.2243781486146.
[I 2023-08-24 21:39:07,897] Trial 92 finished with value: 279.46095594891375 and parameters: {'n_estimators': 315, 'learning_rate': 0.2550774152499502, 'max_depth': 9, 'reg_lambda': 1.0006885247246193, 'subsample': 0.4751517688643813, 'colsample_bytree': 0.7107848457883194, 'gamma': 0.11938031091527934, 'min_child_weight': 1}. Best is trial 89 with value: 269.2243781486146.
[I 2023-08-24 21:39:08,257] Trial 93 finished with value: 268.49106580604536 and parameters: {'n_estimators': 388, 'learning_rate': 0.27591688091279815, 'max_depth': 8, 'reg_lambda': 4.114256239834311, 'subsample': 0.4488691100208403, 'colsample_bytree': 0.6797203555585086, 'gamma': 0.05485185212903983, 'min_child_weight': 1}. Best is trial 93 with value: 268.49106580604536.
[I 2023-08-24 21:39:08,611] Trial 94 finished with value: 271.80698869450566 and parameters: {'n_estimators': 351, 'learning_rate': 0.28262009546200967, 'max_depth': 8, 'reg_lambda': 0.3378461141864501, 'subsample': 0.4487850790428043, 'colsample_bytree': 0.6852330136206313, 'gamma': 0.05550074982972157, 'min_child_weight': 1}. Best is trial 93 with value: 268.49106580604536.
[I 2023-08-24 21:39:08,938] Trial 95 finished with value: 271.37443546323993 and parameters: {'n_estimators': 405, 'learning_rate': 0.27913303701950304, 'max_depth': 8, 'reg_lambda': 1.5366493622447066, 'subsample': 0.4486042623748286, 'colsample_bytree': 0.676803279951354, 'gamma': 0.054702701999234764, 'min_child_weight': 2}. Best is trial 93 with value: 268.49106580604536.
[I 2023-08-24 21:39:09,249] Trial 96 finished with value: 279.8647793017947 and parameters: {'n_estimators': 398, 'learning_rate': 0.2789584395705792, 'max_depth': 7, 'reg_lambda': 0.3492011644777729, 'subsample': 0.39402751076835313, 'colsample_bytree': 0.6795071210550525, 'gamma': 0.05233425116768309, 'min_child_weight': 2}. Best is trial 93 with value: 268.49106580604536.
[I 2023-08-24 21:39:09,577] Trial 97 finished with value: 271.273272689704 and parameters: {'n_estimators': 419, 'learning_rate': 0.22936684290306802, 'max_depth': 8, 'reg_lambda': 0.8360523902986643, 'subsample': 0.4141749828009126, 'colsample_bytree': 0.6572846232591825, 'gamma': 0.004692807666286096, 'min_child_weight': 1}. Best is trial 93 with value: 268.49106580604536.
[I 2023-08-24 21:39:09,898] Trial 98 finished with value: 275.94337536405857 and parameters: {'n_estimators': 425, 'learning_rate': 0.297868963638918, 'max_depth': 8, 'reg_lambda': 0.3704185670629215, 'subsample': 0.42299890292783365, 'colsample_bytree': 0.6599775577746472, 'gamma': 0.00505308964269518, 'min_child_weight': 1}. Best is trial 93 with value: 268.49106580604536.
[I 2023-08-24 21:39:10,201] Trial 99 finished with value: 280.4633100303054 and parameters: {'n_estimators': 491, 'learning_rate': 0.27784237696610226, 'max_depth': 8, 'reg_lambda': 2.9988365404132584, 'subsample': 0.35183934580358445, 'colsample_bytree': 0.5978341994723186, 'gamma': 0.026413181848724987, 'min_child_weight': 1}. Best is trial 93 with value: 268.49106580604536.
Best Hyperparameters: {'n_estimators': 388, 'learning_rate': 0.27591688091279815, 'max_depth': 8, 'reg_lambda': 4.114256239834311, 'subsample': 0.4488691100208403, 'colsample_bytree': 0.6797203555585086, 'gamma': 0.05485185212903983, 'min_child_weight': 1}
Best Score (MAE): 268.49106580604536

🤖 Pulling it all together

¶

In [ ]:
objCol = ['TruckSID',
 'Engine',
 'Transmission',
 'FrontAxlePosition',
 'FrameRails',
 'Liner',
 'FrontEndExt',
 'Cab',
 'RearAxels',
 'RearSusp',
 'FrontSusp',
 'RearWheels',
 'RearTires',
 'FrontWheels',
 'FrontTires',
 'TagAxle',
 'EngineFamily',
 'TransmissionFamily']

class DropTargets(BaseEstimator, TransformerMixin):
    def __init__(self, targets = ['ActualWeightBack','ActualWeightFront','ActualWeightTotal']):
        self.targets = targets
        
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        try:
            X = X.drop(self.targets, axis = 1)
        except:
            pass
        return X

class convertCatColumnsToString(BaseEstimator, TransformerMixin):
    def __init__(self, obj_Col = objCol):
        self.obj_Col = obj_Col
        
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X[self.obj_Col] = X[self.obj_Col].astype('string')
        return X

class replaceSpace(BaseEstimator, TransformerMixin):      
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        all_cols = X.columns
        for col in all_cols:
            try:
                X[col] = X[col].str.replace(' ', '')
            except:
                pass
        return X
    

class replaceDot(BaseEstimator, TransformerMixin):
    def __init__(self, obj_Col = ['EngineFamily']):
        self.cols = obj_Col
           
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for col in self.cols:
            X[col] = X[col].str.replace('.', '')
        return X
    
# Step 1: Custom transformer to convert string columns to category
class StringToCategory(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        string_cols = X.select_dtypes(include=['string']).columns
        X[string_cols] = X[string_cols].astype('category')
        return X
    
class CheckUpperAndLowerBound(BaseEstimator, TransformerMixin):
    def __init__(self, upper_bound = 344, lower_bound = -164,variable = 'Overhang',replacement_value = 90):
        self.upper_bound = upper_bound
        self.lower_bound = lower_bound
        self.variable = variable
        self.replacement_value = replacement_value
        
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X.loc[X[self.variable] < self.lower_bound, self.variable] = self.replacement_value
        X.loc[X[self.variable] > self.upper_bound , self.variable] = self.replacement_value
        return X

class DataFrameScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = StandardScaler()
        self.columns = None

    def fit(self, X, y=None):
        self.scaler.fit(X, y)
        self.columns = X.columns
        return self

    def transform(self, X):
        X_scaled = self.scaler.transform(X)
        return pd.DataFrame(X_scaled, columns=self.columns)

class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.columns = selected_features
        
      
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Engine_Transmission'] = X['Engine'] * X['Transmission']
        X['TransmissionFamily_EngineFamily'] = X['TransmissionFamily'] * X['EngineFamily']
        
        # Polynomial Features for numeric variables:
        X['WheelBase_squared'] = X['WheelBase'] ** 2
        X['Overhang_squared'] = X['Overhang'] ** 2
        
        # Ratio Features:
        X['Front_to_Rear_Wheels'] = X['FrontWheels'] / (X['RearWheels'] + 0.001)  # Add a small number to avoid division by zero
        X['WheelBase_to_Overhang'] = X['WheelBase'] / (X['Overhang'] + 0.001)

        # Aggregated Features for TransmissionFamily and EngineFamily:
        X['avg_WheelBase_per_TransmissionFamily'] = X.groupby('TransmissionFamily')['WheelBase'].transform('mean')
        X['avg_Overhang_per_EngineFamily'] = X.groupby('EngineFamily')['Overhang'].transform('mean')
        
        # Features based on other columns:
        X['sum_WheelBase_per_Engine'] = X.groupby('Engine')['WheelBase'].transform('sum')
        X['max_Overhang_per_Transmission'] = X.groupby('Transmission')['Overhang'].transform('max')
        
        X = X[self.columns]
        return X
from numpy import log1p

class LogTransform(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
            
    def fit(self, X, y=None):
        return self

    def transform(self, X):
            # If specific columns are provided
        if self.columns:
            for col in self.columns:
                X[col] = log1p(X[col])  # Using log1p to also handle 0 values
        else:
                # If no specific columns are provided, apply to the whole DataFrame
            X = log1p(X)
        return X

class DropIDColumn(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
            
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.drop(self.columns, axis = 1)
        return X
In [ ]:
training = pd.read_csv('training.csv',delimiter=';')
target_ActualWeightFront = training['ActualWeightFront']
target_ActualWeightTotal = training['ActualWeightTotal']
training
Out[ ]:
TruckSID ActualWeightFront ActualWeightBack ActualWeightTotal Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily
0 31081 11280 8030 19310 1012011 2700028 3690005 249 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933469 9050015 930469 3P1998 101D100 270C25
1 30580 10720 6660 17380 1012011 2700022 3690005 183 68 403012 404002 4070004 5000004 330507 3500004 3700011 9142001 933469 9050031 930821 3P1998 101D100 270C24
2 31518 11040 6230 17270 1012001 2700022 3690005 216 68 403012 404002 4070004 5000001 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C24
3 31816 11210 7430 18640 1012002 2700028 3690005 219 104 403012 404002 4070004 5000002 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
4 30799 11910 7510 19420 1012019 2700028 3690005 231 104 403012 404002 4070004 5000001 330444 3500004 3700011 9142001 933469 9050037 930469 3P1998 101D102 270C25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2639 34891 10110 9830 19940 1012012 2700024 3690005 210 104 403012 404998 4070004 5000002 3300041 3500003 3700002 9140016 933469 9050015 930469 3P1998 101D100 270C24
2640 25021 11150 6700 17850 1012002 2700028 3690005 210 74 403012 404002 4070004 5000003 330444 3500004 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
2641 33141 10850 7020 17870 1012002 2700028 3690005 222 80 403012 404998 4070004 5000002 330444 3500014 3700002 9140014 933062 9050015 930469 3P1998 101D97 270C25
2642 40311 10380 6850 17230 1012011 2700022 3690005 222 56 403012 404002 4070004 5000001 330444 3500004 3700002 9142003 933469 9052003 930469 3P1998 101D100 270C24
2643 33401 9820 8760 18580 1012011 2700022 3690005 198 104 403012 404998 4070004 5000002 330507 3500004 3700002 9142001 933469 9050037 930469 3P1998 101D100 270C24

2644 rows × 23 columns

Building the Front Model

¶

In [ ]:
prep_fe_pipeline_front = Pipeline([
    ('drop_Targets', DropTargets()),
    ('convert_cat_to_string', convertCatColumnsToString()),
    ('replace_dot', replaceDot()),
    ('replace_space', replaceSpace()),
    ('str_to_cat', StringToCategory()),
    ('upper_lower', CheckUpperAndLowerBound()),
    ('target_encode', ce.TargetEncoder(handle_unknown='value', handle_missing='value')),
    ('scale', DataFrameScaler()),
    ('FE',FeatureEngineer()),
    ])

formattedData = prep_fe_pipeline_front.fit_transform(training,target_ActualWeightFront)
formattedData
Out[ ]:
Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily Engine_Transmission TransmissionFamily_EngineFamily WheelBase_squared Overhang_squared Front_to_Rear_Wheels WheelBase_to_Overhang avg_Overhang_per_EngineFamily sum_WheelBase_per_Engine max_Overhang_per_Transmission
0 -0.775150 1.199613 0.085077 2.620206 0.867729 0.647233 1.118745 -0.061616 -0.946623 0.928864 0.251651 0.402826 1.262227 -0.240839 1.224191 -0.359090 -0.15539 -0.774318 1.059240 -0.929880 -0.820189 6.865481 0.752953 0.969098 3.016139 -0.089933 -44.907440 2.373981
1 -0.775150 -0.917003 0.085077 -1.545882 -1.391651 0.647233 1.118745 -0.061616 1.193472 -0.427545 0.251651 -0.687423 -0.317981 -0.240839 0.801290 1.816748 -0.15539 -0.774318 -0.944073 0.710815 0.731013 2.389750 1.936692 -2.527878 1.111625 -0.089933 -44.907440 1.244292
2 0.194674 -0.917003 0.085077 0.537162 -1.391651 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 -0.944073 -0.178516 -0.267848 0.288543 1.936692 0.969098 -0.386267 -0.073598 16.114870 1.244292
3 0.270655 1.199613 0.085077 0.726530 0.867729 0.647233 1.118745 -0.061616 -0.946623 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240 0.324682 0.300522 0.527846 0.752953 0.969098 0.836314 -0.073598 130.089260 2.373981
4 1.580971 1.199613 0.085077 1.484000 0.867729 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 -0.687423 -0.317981 -0.240839 -0.856580 -0.359090 -0.15539 1.551007 1.059240 1.896553 1.642889 2.202257 0.752953 2.702304 1.708244 0.409488 -7.553028 2.373981
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2639 -0.175652 -0.231394 0.085077 0.158427 0.867729 0.647233 -0.906376 -0.061616 -0.946623 -1.712860 -3.511279 0.402826 1.691276 -0.240839 1.224191 -0.359090 -0.15539 -0.774318 -0.944073 0.040645 0.731013 0.025099 0.752953 0.723399 0.182367 -0.089933 0.633708 0.867729
2640 0.270655 1.199613 0.085077 0.158427 -1.015088 0.647233 1.118745 -0.061616 2.089485 0.928864 0.251651 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240 0.324682 0.300522 0.025099 1.030403 0.969098 -0.156226 -0.073598 130.089260 2.373981
2641 0.270655 1.199613 0.085077 0.915898 -0.638524 0.647233 -0.906376 -0.061616 -0.946623 0.928864 0.517296 0.402826 1.262227 0.944712 1.224191 -0.359090 -0.15539 0.283715 1.059240 0.324682 0.300522 0.838868 0.407713 0.969098 -1.436647 -0.073598 130.089260 2.373981
2642 -0.775150 -0.917003 0.085077 0.915898 -2.144777 0.647233 1.118745 -0.061616 -0.408809 0.928864 0.251651 0.402826 -0.721544 -0.240839 -1.037440 -0.359090 -0.15539 -0.774318 -0.944073 0.710815 0.731013 0.838868 4.600069 1.439802 -0.427235 -0.089933 -44.907440 1.244292
2643 -0.775150 -0.917003 0.085077 -0.599043 0.867729 0.647233 -0.906376 -0.061616 -0.946623 -0.427545 0.251651 0.402826 -0.317981 -0.240839 -0.856580 -0.359090 -0.15539 -0.774318 -0.944073 0.710815 0.731013 0.358853 0.752953 2.702304 -0.689563 -0.089933 -44.907440 1.244292

2644 rows × 28 columns

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(formattedData, target_ActualWeightFront, test_size=0.3, random_state=42)
In [ ]:
#{'n_estimators': 495, 'learning_rate': 0.133354970082411, 'max_depth': 7, 'reg_lambda': 4.53385562551496e-08, 'subsample': 0.8141582328181602, 'colsample_bytree': 0.7732232759096854, 'gamma': 0.34364076800046645, 'min_child_weight': 1}
model = XGBRegressor(n_estimators=495, learning_rate=0.133354970082411, max_depth=7, reg_lambda=4.53385562551496e-08, subsample=0.8141582328181602, colsample_bytree=0.7732232759096854, gamma=0.34364076800046645, min_child_weight=1, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print("MAE: ", mae)
MAE:  128.6261733509131
In [ ]:
def plotReliabilityGraphs(y_test,y_pred,plot_name):
    residuals = y_test - y_pred
    plt.scatter(y_pred, residuals)
    plt.axhline(0, color='red', linestyle='--')
    plt.xlabel('Predicted')
    plt.ylabel('Residuals')
    plt.title(f'Residual Plot for {plot_name}')
    plt.show()

    plt.scatter(y_test, y_pred)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.title(f'Actual vs. Predicted for {plot_name}')
    plt.show()

    plt.hist(residuals, bins=20)
    plt.xlabel('Residuals')
    plt.ylabel('Frequency')
    plt.title(f'Histogram of Residuals for {plot_name}')
    plt.show()

    stats.probplot(residuals, plot=plt)
    plt.show()

plotReliabilityGraphs(y_test,y_pred,'XGBoost without bagging for Front')
In [ ]:
xgb_model = XGBRegressor(n_estimators=495, learning_rate=0.133354970082411, max_depth=7, reg_lambda=4.53385562551496e-08, subsample=0.8141582328181602, colsample_bytree=0.7732232759096854, gamma=0.34364076800046645, min_child_weight=1, random_state=42)

# Wrapping the model within a BaggingRegressor
bagging_model_front = BaggingRegressor(base_estimator=xgb_model, n_estimators=3, random_state=0)
bagging_model_front.fit(X_train, y_train)
y_pred = bagging_model_front.predict(X_test)
print("MAE: ", mean_absolute_error(y_test, y_pred))

plotReliabilityGraphs(y_test,y_pred, 'XGBoost with Bagging Regressor for Front')
MAE:  132.40994224260075

Building the Total Model

¶

In [ ]:
prep_fe_pipeline_total = Pipeline([
    ('drop_Targets', DropTargets()),
    ('convert_cat_to_string', convertCatColumnsToString()),
    ('replace_dot', replaceDot()),
    ('replace_space', replaceSpace()),
    ('str_to_cat', StringToCategory()),
    ('upper_lower', CheckUpperAndLowerBound()),
    ('target_encode', ce.TargetEncoder(handle_unknown='value', handle_missing='value')),
    ('scale', DataFrameScaler()),
    ('DropIdColumn', DropIDColumn(['TruckSID'])),
    ])

formattedData = prep_fe_pipeline_total.fit_transform(training,target_ActualWeightTotal)
X_train, X_test, y_train, y_test = train_test_split(formattedData, target_ActualWeightTotal, test_size=0.3, random_state=42)
X_train
Out[ ]:
Engine Transmission FrontAxlePosition WheelBase Overhang FrameRails Liner FrontEndExt Cab RearAxels RearSusp FrontSusp RearWheels RearTires FrontWheels FrontTires TagAxle EngineFamily TransmissionFamily
1385 -0.680991 -0.776702 0.085077 -0.788411 0.867729 -1.325093 -0.890816 -0.061616 0.923031 0.596275 0.26966 0.690944 0.357058 0.521415 0.195839 0.194185 -0.117407 -0.685647 -0.944073
297 -0.556468 -0.776702 0.085077 0.537162 -1.391651 0.636769 1.128937 -0.061616 0.923031 0.596275 0.26966 0.690944 0.326938 -0.675918 0.985025 0.194185 -0.117407 -0.072367 -0.944073
598 -0.680991 -0.776702 0.085077 0.158427 -1.015088 -1.325093 1.128937 -0.061616 -1.345265 0.308506 0.26966 0.690944 0.357058 0.521415 0.452175 0.538199 -0.117407 -0.685647 -0.944073
1644 -0.145792 1.184255 0.085077 2.809574 -0.638524 0.636769 1.128937 -0.061616 0.691917 0.596275 0.26966 0.690944 1.520929 0.521415 0.985025 0.194185 -0.117407 -0.072367 1.059240
751 -0.145792 1.184255 0.085077 -1.356514 0.867729 0.636769 1.128937 -0.061616 -0.536155 0.596275 0.26966 -1.450352 0.326938 -0.675918 0.985025 0.194185 -0.117407 -0.072367 1.059240
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1638 -0.680991 -0.776702 0.085077 -0.030941 -0.261961 0.636769 1.128937 -0.061616 0.923031 0.308506 0.26966 0.690944 -2.236636 -3.161144 -2.556963 -3.647514 -0.117407 -0.685647 -0.944073
1095 -0.680991 -0.345998 0.085077 0.158427 -0.261961 0.636769 -0.890816 -0.061616 -1.345265 -2.609984 0.26966 0.690944 -0.667824 0.521415 -1.016930 0.194185 -0.117407 -0.685647 1.059240
1130 -0.680991 1.184255 0.085077 0.726530 0.867729 0.636769 -0.890816 -0.061616 -1.345265 0.596275 0.26966 0.690944 -0.967372 -0.675918 -1.016930 0.194185 -0.117407 -0.685647 1.059240
1294 1.928706 1.184255 0.085077 -0.220308 0.867729 0.636769 -0.890816 -0.061616 0.923031 0.596275 0.26966 0.690944 0.357058 0.521415 0.452175 0.538199 -0.117407 1.886329 1.059240
860 -0.145792 1.184255 0.085077 2.809574 -0.638524 0.636769 1.128937 -0.061616 -0.536155 0.596275 0.26966 -1.450352 1.520929 0.521415 0.985025 0.194185 -0.117407 -0.072367 1.059240

1850 rows × 19 columns

In [ ]:
#{'n_estimators': 993, 'learning_rate': 0.24702617076976288, 'max_depth': 8, 'reg_lambda': 1.3184725672621053e-09, 'subsample': 0.7745537544177631, 'colsample_bytree': 0.5084493201928549, 'gamma': 0.9975234538045155, 'min_child_weight': 9}
xgb_model = XGBRegressor(n_estimators=993, learning_rate=0.24702617076976288, max_depth=8, reg_lambda=1.3184725672621053e-09, subsample=0.7745537544177631, colsample_bytree=0.5084493201928549, gamma=0.9975234538045155, min_child_weight=9,random_state=42)

# Wrapping the model within a BaggingRegressor
bagging_model_total = BaggingRegressor(base_estimator=xgb_model, n_estimators=10)
bagging_model_total.fit(X_train, y_train)
y_pred = bagging_model_total.predict(X_test)
print('MAE:', mean_absolute_error(y_test, y_pred))

plotReliabilityGraphs(y_test,y_pred, 'XGBoost with Bagging Regressor for Total')
MAE: 280.8806896449937

📝 Predicting The Test Data

¶

In [ ]:
testData = pd.read_csv('testing.csv',delimiter=';')
test_data_for_total = prep_fe_pipeline_total.transform(testData)
test_data_for_front = prep_fe_pipeline_front.transform(testData)
In [ ]:
total_pred = bagging_model_total.predict(test_data_for_total).astype(int)
front_pred = bagging_model_front.predict(test_data_for_front).astype(int)
back_pred = (total_pred - front_pred).astype(int)
In [ ]:
finalDf = pd.DataFrame({'TruckSID':testData['TruckSID'],
                        'PredictedWeightFront':front_pred.flatten(),
                        'PredictedWeightBack':back_pred.flatten(),
                        'PredictedWeightTotal':total_pred.flatten()})

finalDf.to_csv('finalProduct2.csv', index=False)
In [ ]:
pred = bagging_model_front.predict(test_data_for_front)
In [ ]:
pred
Out[ ]:
array([10919.966 , 12096.512 , 10939.284 , 10506.544 , 11287.884 ,
       10595.5205, 10220.859 , 10477.777 , 12141.856 , 11448.593 ,
       10399.335 , 10952.543 , 10437.177 , 11210.692 , 11105.292 ,
       12421.809 , 10791.581 , 10299.2295, 10771.863 , 10978.629 ,
       10405.955 , 10120.337 , 10497.404 , 10458.731 , 10328.438 ,
       10890.548 , 11443.997 , 10590.777 ,  9879.163 , 11238.274 ,
       11500.628 , 11062.325 , 11210.958 , 10892.074 , 11195.583 ,
       11416.176 , 10432.015 , 10686.352 , 10258.469 , 12125.937 ,
        9755.555 , 10345.53  , 10659.266 , 10118.015 , 11042.6455,
       11104.478 , 12502.105 , 10090.04  , 12294.133 , 10257.591 ,
        9685.571 , 12060.245 , 11984.082 , 10441.053 , 10975.647 ,
       11283.767 , 10399.78  , 11319.277 , 10353.947 , 10884.288 ,
        9793.021 , 10812.937 , 10474.135 , 10733.358 , 11154.466 ,
       10251.309 , 11073.312 , 11319.277 , 10541.645 , 12487.707 ,
        9871.865 , 10528.238 , 10587.112 , 10803.531 , 10679.147 ,
       10249.28  ,  9638.108 , 11986.81  , 10946.532 , 11215.629 ,
       10943.289 , 10054.1455, 12110.292 , 12604.263 , 10950.159 ,
       12242.367 , 12539.137 , 10298.261 , 10553.04  , 11418.55  ,
        9663.673 ,  9863.523 , 10494.723 , 11211.98  , 10089.114 ,
       11275.733 , 10922.423 , 10712.058 , 11950.847 , 10182.765 ,
       10409.296 , 12146.058 , 12217.683 , 11013.109 , 11320.855 ,
       10229.929 , 10440.531 , 12448.56  , 11109.714 , 11375.681 ,
       11615.446 , 11209.7   , 10577.309 , 10113.798 , 10442.873 ,
       12489.892 , 11611.749 , 10500.301 , 10522.988 , 10706.741 ,
       12545.0205, 10215.619 , 11119.649 , 10428.082 , 10362.093 ,
       10645.121 , 10535.516 , 10974.109 , 12127.628 , 10840.29  ,
       10895.418 , 12213.531 , 11933.724 , 10726.531 , 11233.477 ,
       10805.665 , 10932.632 , 10440.915 , 11873.37  , 12364.22  ,
       10298.261 , 10347.611 , 10311.133 , 11188.011 , 11953.847 ,
       11194.606 , 10671.03  , 12331.706 , 11805.839 , 12181.5205,
       10435.081 , 10592.04  , 10910.126 ,  9950.243 , 10828.354 ,
        9984.835 , 10923.172 , 10905.212 , 10414.692 , 11241.73  ,
       10236.226 , 10502.255 , 10712.278 , 11320.047 , 10928.023 ,
       11068.253 , 11258.628 , 10942.898 , 11227.575 , 10922.894 ,
       11518.956 , 11704.18  , 10600.339 , 10816.6875, 10527.824 ,
       10600.856 , 11447.004 , 10762.624 , 11057.544 , 11285.917 ,
       11213.368 , 10565.848 , 10901.038 , 10809.657 , 11371.082 ,
       10580.542 , 10228.4795, 10331.227 , 10476.05  , 10892.074 ,
       10291.181 , 10782.22  , 11014.968 , 11629.813 , 11107.984 ,
       10723.262 , 10009.741 , 11117.778 , 11186.168 , 10380.872 ,
       11450.66  , 10313.824 , 10498.339 , 10348.397 , 11094.718 ,
       10387.672 , 10158.429 ,  9852.298 , 10871.276 , 11212.719 ,
       11339.108 , 10512.33  , 11446.0625, 12448.586 , 10378.921 ,
       10446.622 , 10209.477 ,  9918.968 , 10782.991 , 10241.102 ,
       10139.665 , 11381.718 , 11371.847 , 12090.797 , 11392.059 ,
       10298.512 , 10574.2   , 10864.028 , 10798.855 , 10650.13  ,
       11314.433 , 10383.981 , 10992.247 , 11086.757 , 10309.046 ,
       11155.608 , 10336.306 , 11241.73  , 10457.317 ,  9934.751 ,
       10542.458 , 10970.79  , 10406.591 , 11117.19  , 10726.996 ,
       10842.741 , 11021.902 , 10711.824 , 11575.61  , 10130.535 ,
       10409.489 , 11140.508 , 10753.207 , 11326.642 , 12007.185 ,
       11519.409 , 10925.5205, 12673.227 , 11194.009 , 12001.875 ,
       11935.579 , 10445.171 , 12440.55  , 10652.847 , 11904.636 ,
       11028.758 ,  9766.553 ,  9775.177 , 11951.427 ,  9712.925 ,
       10406.255 , 10347.946 , 10744.247 , 11234.356 , 10858.819 ,
       10855.664 ,  9683.858 , 11018.55  , 11844.402 , 10237.765 ,
       11062.593 , 11238.733 , 10808.125 , 10509.999 , 11254.702 ,
       11094.718 , 11991.761 , 11176.398 ,  9532.988 , 10670.311 ,
       10838.297 , 10321.95  , 12571.164 , 10174.665 , 11234.542 ,
       11201.497 , 10840.671 , 10477.527 , 10479.625 , 11171.995 ,
       10316.495 , 10520.117 , 10162.434 , 10527.025 , 11821.87  ,
       10201.593 , 11948.042 ,  9777.745 , 11234.147 , 10167.571 ,
       11247.284 , 11292.335 , 11319.2295, 10118.669 , 10809.776 ,
        9569.882 , 11105.292 , 10571.884 ,  9731.54  , 10847.536 ,
       10484.281 , 10280.562 , 11964.133 , 10431.091 , 10668.087 ,
       11030.599 , 11197.617 , 10069.817 , 12453.798 , 10493.122 ,
       10407.923 , 11240.601 , 10650.065 , 11719.039 , 11722.56  ,
       11135.184 , 10899.083 , 10105.503 , 10636.351 ,  9620.647 ,
       11832.202 , 10385.669 , 10819.923 , 11911.664 , 12182.831 ,
       10332.53  , 10848.527 , 10072.395 ,  9990.332 ,  9753.204 ,
       10182.733 ,  9824.64  , 11545.544 , 10942.941 ,  9811.895 ,
       11971.077 , 11019.892 , 10638.319 , 11146.005 , 12374.159 ,
       10201.402 , 10479.762 , 11205.86  , 10362.062 , 10532.257 ,
       10139.105 , 10519.307 , 11021.902 , 10997.742 ,  9831.267 ,
       10215.674 , 11321.489 , 11107.711 ,  9772.89  , 10623.058 ,
       12023.945 , 10987.46  , 11042.6455,  9681.184 , 10189.015 ,
       10760.724 , 11616.148 , 10583.976 , 10699.116 , 11532.472 ,
       11795.669 , 10604.569 , 10442.695 , 10337.302 , 10445.481 ,
       10798.185 , 10416.203 , 11002.853 , 11070.677 , 10376.858 ,
       10455.386 ,  9827.785 , 11829.641 , 11952.667 , 10596.542 ,
       10357.839 , 10906.133 , 10930.882 , 10830.777 , 10582.032 ,
       11068.253 , 10639.761 , 10430.18  ,  9986.272 , 12345.976 ,
       10260.767 , 10028.018 , 10890.471 , 10443.397 , 11241.73  ,
       11000.927 ,  9726.772 , 10478.368 , 10325.742 ,  9734.253 ,
       12277.368 , 11228.052 , 10859.292 , 10456.904 , 11120.489 ,
       10265.901 , 11022.301 , 10507.554 , 10870.124 , 10845.883 ,
       11078.046 , 11883.828 , 10628.03  , 10530.761 , 10140.081 ,
       10664.711 , 10202.171 ,  9957.669 , 12463.61  , 12266.022 ,
       12100.003 , 10781.351 , 10212.599 , 10763.234 , 11501.456 ,
       12076.026 , 11861.535 , 11927.992 , 10635.94  , 10420.773 ,
        9851.723 , 11190.356 , 10253.171 , 10436.681 , 10803.178 ,
       10491.449 , 10926.934 , 11071.071 , 10467.522 ,  9662.821 ,
       11088.07  , 10380.595 , 11145.278 , 11214.013 , 11012.378 ,
       10785.871 ,  9470.16  ,  9918.968 , 10197.958 , 10448.28  ,
       12292.165 , 10130.377 , 12054.294 ,  9991.591 , 10776.443 ,
       10250.77  , 10401.868 , 10859.571 , 11191.157 , 10135.177 ,
       10505.474 , 10565.773 ,  9853.741 , 10170.591 , 10634.277 ,
       11844.589 , 10579.923 , 10563.228 , 10339.052 , 10593.448 ,
       11511.59  ,  9761.8125, 10285.814 , 11798.851 , 10457.605 ,
        9586.562 , 10394.813 , 11052.87  , 11048.843 , 10427.5205,
       11446.0625, 10565.501 ,  9683.114 , 12456.698 , 12412.724 ,
       11331.583 , 10603.062 , 10409.11  , 10226.426 , 10190.03  ,
       10253.73  , 10830.573 ,  9768.13  , 10246.091 , 11244.481 ,
        9883.207 , 11124.95  , 10382.316 , 11035.515 , 11140.66  ,
       11193.458 , 10793.0205, 12220.059 , 10590.89  , 10760.504 ,
       11075.352 , 12440.55  , 11933.378 , 10632.737 , 10409.406 ,
       10673.077 , 10393.237 , 10543.015 , 10523.575 , 10997.517 ,
       10428.228 , 10752.324 , 12170.819 , 10595.883 , 10681.974 ,
        9771.753 , 11171.995 , 10758.4795,  9755.555 , 10375.554 ,
       11245.954 , 11053.892 , 11238.274 , 11464.532 , 10865.571 ,
       10989.3125, 10386.331 , 11169.336 , 10334.659 , 11840.231 ,
       11229.935 , 11106.302 , 11153.632 , 10361.114 , 10467.274 ,
       10125.533 , 10545.473 , 11216.897 , 10736.565 , 11477.685 ,
       10145.339 , 10612.435 , 10723.931 , 11127.897 , 10346.552 ,
       11029.719 , 10208.599 , 11256.382 , 10348.397 , 10075.114 ,
       10270.925 , 10699.655 , 11265.269 , 10236.844 , 11916.2   ,
        9792.417 , 10452.243 , 11233.3545, 11245.109 , 10346.577 ,
       10416.522 , 12428.445 , 10533.233 , 10135.177 , 11073.809 ,
       10313.824 , 10611.409 , 11361.7295, 11093.6455, 12470.4375,
        9684.1045, 12004.034 , 11106.302 , 10697.472 , 10557.765 ,
       10799.005 , 11797.204 , 10839.129 , 11759.844 , 11403.724 ,
       10484.587 ,  9839.794 , 10929.0205, 11028.879 ,  9869.619 ,
       11892.719 , 10221.862 , 10307.069 , 11599.585 , 10815.79  ,
       11235.246 , 10389.739 , 11462.776 , 10568.532 , 10600.635 ,
       10915.706 , 12092.575 , 10913.819 , 11153.039 , 10356.06  ,
       10291.077 , 12130.426 , 10218.161 , 10969.405 , 11074.315 ,
       11031.575 ,  9496.825 ,  9863.523 , 10954.483 , 10430.741 ,
       10785.808 , 11256.152 , 11546.942 , 12113.148 , 10177.84  ,
       11251.78  , 10793.695 , 10433.417 , 12539.137 , 12005.831 ,
        9851.672 , 11472.621 , 10451.226 , 11211.019 , 11948.147 ,
       10960.692 , 10236.772 , 10054.004 ,  9853.823 , 11466.215 ,
       11383.43  , 10494.957 , 11036.384 , 10353.695 , 12036.261 ,
       10399.237 , 10863.066 , 11173.993 , 12195.125 , 11134.862 ,
       11214.169 ,  9532.257 , 10218.542 , 10476.782 , 10926.903 ,
       10387.601 , 10041.191 , 11508.375 , 10975.726 , 12194.327 ,
       11001.196 , 10163.793 , 10422.716 , 10196.39  , 11319.277 ,
       11851.106 , 10905.927 , 11001.242 , 12666.985 , 11911.664 ,
       11512.675 , 10269.776 , 11191.667 , 10678.403 , 12491.099 ,
       11123.073 , 10209.746 , 11321.489 , 10726.512 , 12530.016 ,
       11244.452 , 11341.183 , 11714.813 , 11018.446 , 11504.731 ,
       10639.897 , 11847.306 , 10799.57  , 11339.964 , 10847.511 ,
       10263.716 , 10897.173 , 11990.747 , 11287.805 , 10146.464 ,
       11019.785 , 11560.8   , 11165.882 , 12585.551 , 10848.289 ,
       12230.457 , 10434.004 , 10969.949 , 10993.44  ,  9790.655 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10479.762 , 10479.762 , 10479.762 ,
       10479.762 , 10479.762 , 10871.276 , 10479.762 , 10871.276 ,
       10479.762 , 10871.276 , 10479.762 , 10871.276 , 10479.762 ,
       10871.276 , 10479.762 , 10871.276 , 10479.762 , 10871.276 ,
       10479.762 , 10871.276 , 10479.762 , 10871.276 ,  9990.332 ,
       10479.762 , 10871.276 ,  9990.332 , 10479.762 , 10871.276 ,
        9990.332 , 10479.762 , 10871.276 ,  9990.332 , 10479.762 ,
       10871.276 ,  9990.332 , 10479.762 , 10871.276 ,  9990.332 ,
       10479.762 , 10871.276 ,  9990.332 , 10479.762 , 10871.276 ,
        9990.332 , 10479.762 , 10871.276 ,  9990.332 , 10479.762 ,
       10871.276 ,  9990.332 , 10479.762 , 10871.276 ,  9990.332 ,
       10479.762 , 10871.276 ,  9990.332 , 10479.762 , 10871.276 ,
        9990.332 , 10479.762 , 10871.276 ,  9990.332 , 10479.762 ,
       10871.276 ,  9990.332 , 10479.762 , 10871.276 ,  9777.745 ,
        9990.332 , 10479.762 , 10871.276 ,  9777.745 ,  9990.332 ,
       10479.762 , 10871.276 ,  9777.745 ,  9990.332 , 10479.762 ,
       10659.266 , 10871.276 ,  9777.745 , 10668.087 ,  9990.332 ,
       10479.762 , 10659.266 , 10871.276 ,  9777.745 , 10668.087 ,
        9990.332 , 10479.762 , 10659.266 , 10871.276 ,  9777.745 ,
       10668.087 ,  9990.332 , 10479.762 , 10659.266 , 10871.276 ,
        9777.745 , 10668.087 ,  9990.332 , 10479.762 ,  9831.267 ,
       10659.266 , 10871.276 ,  9777.745 , 10668.087 ,  9990.332 ,
       10479.762 ,  9831.267 , 10659.266 , 10871.276 ,  9777.745 ,
       10668.087 ,  9990.332 , 10479.762 ,  9831.267 , 10348.397 ,
       10659.266 , 10871.276 , 11238.733 ,  9777.745 , 10668.087 ,
        9990.332 , 10479.762 ,  9831.267 , 10348.397 , 10659.266 ,
       10871.276 , 11238.733 ,  9777.745 , 10668.087 ,  9990.332 ,
       10479.762 ,  9831.267 , 10348.397 , 10659.266 , 10871.276 ,
       11238.733 ,  9777.745 , 10668.087 ,  9990.332 , 10479.762 ,
        9831.267 , 10348.397 , 10659.266 , 10871.276 , 11238.733 ,
        9777.745 , 10668.087 ,  9990.332 , 10479.762 ,  9831.267 ,
       10348.397 , 10659.266 , 10871.276 , 11238.733 , 11991.761 ,
        9777.745 , 10668.087 ,  9990.332 , 10479.762 ,  9831.267 ,
       10348.397 , 10659.266 , 10871.276 , 11238.733 , 11991.761 ,
        9777.745 , 10668.087 ,  9990.332 , 10479.762 ,  9831.267 ,
       10348.397 , 10659.266 , 10871.276 , 11238.733 , 11991.761 ,
        9777.745 , 10668.087 ,  9990.332 , 10479.762 ,  9831.267 ,
       10348.397 , 10659.266 , 10871.276 , 11238.733 , 11991.761 ,
        9777.745 , 10668.087 ,  9990.332 , 10479.762 ,  9831.267 ,
       10348.397 , 11287.884 , 10659.266 , 10871.276 , 11238.733 ,
       11991.761 ,  9777.745 , 10668.087 ,  9990.332 , 10479.762 ,
        9831.267 , 10348.397 ], dtype=float32)